JP7339151B2

JP7339151B2 - Speech synthesizer, speech synthesis program and speech synthesis method

Info

Publication number: JP7339151B2
Application number: JP2019231876A
Authority: JP
Inventors: 駿介後藤; 弘太郎大西; 健太郎橘; 紘一郎森
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2023-09-05
Anticipated expiration: 2039-12-23
Also published as: JP2021099454A

Description

本発明は、画像から音声を合成する音声合成装置、音声合成プログラム及び音声合成方法に関する。 The present invention relates to a speech synthesizer, a speech synthesis program, and a speech synthesis method for synthesizing speech from images.

文字や音素等の情報を音声合成モデルに入力することによって音声を合成する音声合成装置が知られている。文字や音素に加えて、さらに話者が発声した音声から求められた話者の特徴を音声合成モデルに入力することによって、当該文字や音素に応じて当該話者が発声したような音声を合成する音声合成装置も知られている（非特許文献１）。 A speech synthesizer is known that synthesizes speech by inputting information such as characters and phonemes into a speech synthesis model. In addition to characters and phonemes, by inputting the characteristics of the speaker obtained from the speech uttered by the speaker into the speech synthesis model, it synthesizes speech that looks like the speaker uttered according to the characters and phonemes. A speech synthesizer that does this is also known (Non-Patent Document 1).

また、目標話者とする人物の顔画像の特徴量を主観評価に基づいて抽出し、当該特徴量に応じて当該話者が発声したような音声を統計的モデルに基づいて生成する技術が開示されている（非特許文献２）。 Also disclosed is a technique for extracting the feature amount of a face image of a target speaker based on subjective evaluation, and generating a voice that sounds like the speaker uttered according to the feature amount based on a statistical model. (Non-Patent Document 2).

”Transfer Learning from Speaker Verification to Multi-speaker Text-To-Speech Synthesis”: https://arxiv.org/abs/1806.04558”Transfer Learning from Speaker Verification to Multi-speaker Text-To-Speech Synthesis”: https://arxiv.org/abs/1806.04558 ”A Comparative Study of Statistical Conversion of Face to Voice Based on Their Subjective Impressions”: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2005.pdf”A Comparative Study of Statistical Conversion of Face to Voice Based on Their Subjective Impressions”: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2005.pdf

しかしながら、目標話者とする人物の顔の情報とテキスト情報から当該話者が発声したような音声を客観的に自動合成できる音声合成装置に関する研究は十分になされていない。 However, there has not been sufficient research on a speech synthesizer capable of objectively and automatically synthesizing a speech that is uttered by a target speaker based on facial information and text information of the target speaker.

本発明の１つの態様は、対象物の画像データ、当該対象物が発した音声の音声データ及び当該音声データの内容を示す内容情報を対応付けたデータセットを用いた機械学習によって構築された音声合成装置であって、画像データの入力を受けて、画像データに対する特徴ベクトルを出力する画像エンコーダと、前記画像エンコーダによって生成された特徴ベクトルと、生成する音声の内容を示す内容情報と、の入力を受けて、当該画像データが示す対象物が当該内容情報に対応する内容を発したような音声を合成して出力する音声合成器と、を備え、前記画像エンコーダは、音声データを入力することによって当該音声データに対応付けられた対象物を示す特徴ベクトルを出力するように機械学習されたスピーチエンコーダを用いて、対象物の画像データが入力されたときに出力される特徴ベクトルが当該画像データに対応付けられた音声データが入力されたときに前記スピーチエンコーダから出力される特徴ベクトルに一致するように機械学習され、前記音声合成器は、対象物の音声データが入力されたときに前記スピーチエンコーダから出力される特徴ベクトルと、当該音声データに対応付けられた内容情報と、が入力されたときに合成して出力される音声の音声データが前記スピーチエンコーダに入力された音声データと一致するように機械学習されていることを特徴とする音声合成装置である。 One aspect of the present invention is a voice constructed by machine learning using a data set in which image data of an object, audio data of a voice uttered by the object, and content information indicating the content of the audio data are associated. Input of an image encoder that receives an input of image data and outputs a feature vector for the image data, a feature vector generated by the image encoder, and content information indicating the content of the sound to be generated. a voice synthesizer for synthesizing and outputting a voice as if the object indicated by the image data uttered the content corresponding to the content information, wherein the image encoder receives the voice data; Using a speech encoder machine-learned so as to output a feature vector indicating the object associated with the audio data, the feature vector output when the image data of the object is input is the image data Machine learning is performed so as to match the feature vector output from the speech encoder when voice data associated with is input, and the voice synthesizer is configured to perform the speech when voice data of an object is input. When the feature vector output from the encoder and the content information associated with the audio data are input, the audio data of the audio synthesized and output matches the audio data input to the speech encoder. It is a speech synthesizer characterized by being machine-learned as follows.

本発明の別の態様は、対象物の画像データ、当該対象物が発した音声の音声データ及び当該音声データの内容を示す内容情報を対応付けたデータセットを用いる音声合成プログラムであって、コンピュータを、画像データの入力を受けて、画像データに対する特徴ベクトルを出力する画像エンコーダと、前記画像エンコーダによって生成された特徴ベクトルと、生成する音声の内容を示す内容情報と、の入力を受けて、当該画像データが示す対象物が当該内容情報に対応する内容を発したような音声を合成して出力する音声合成器と、として機能させ、前記画像エンコーダは、音声データを入力することによって当該音声データに対応付けられた対象物を示す特徴ベクトルを出力するように機械学習されたスピーチエンコーダを用いて、対象物の画像データが入力されたときに出力される特徴ベクトルが当該画像データに対応付けられた音声データが入力されたときに前記スピーチエンコーダから出力される特徴ベクトルに一致するように機械学習され、前記音声合成器は、対象物の音声データが入力されたときに前記スピーチエンコーダから出力される特徴ベクトルと、当該音声データに対応付けられた内容情報と、が入力されたときに合成して出力される音声の音声データが前記スピーチエンコーダに入力された音声データと一致するように機械学習されていることを特徴とする音声合成プログラムである。 Another aspect of the present invention is a speech synthesis program using a data set in which image data of an object, audio data of a sound uttered by the object, and content information indicating the content of the audio data are associated, the program comprising: an input of an image encoder that receives an input of image data and outputs a feature vector for the image data, a feature vector generated by the image encoder, and content information indicating the content of the generated sound, and a speech synthesizer that synthesizes and outputs a speech as if the object indicated by the image data uttered the content corresponding to the content information, and the image encoder receives the speech data and outputs the speech. Using a speech encoder machine-learned to output a feature vector indicating the object associated with the data, when image data of the object is input, the output feature vector is associated with the image data. is machine-learned so as to match the feature vector output from the speech encoder when the target speech data is input, and the speech synthesizer is output from the speech encoder when the target speech data is input and the content information associated with the speech data are inputted so that the speech data of speech synthesized and outputted matches the speech data inputted to the speech encoder. A speech synthesis program characterized by being learned.

本発明の別の態様は、対象物の画像データ、当該対象物が発した音声の音声データ及び当該音声データの内容を示す内容情報を対応付けたデータセットを用いる音声合成方法であって、画像データの入力を受けて、画像データに対する特徴ベクトルを出力する画像エンコーダと、前記画像エンコーダによって生成された特徴ベクトルと、生成する音声の内容を示す内容情報と、の入力を受けて、当該画像データが示す対象物が当該内容情報に対応する内容を発したような音声を合成して出力する音声合成器と、を用いて音声を合成し、前記画像エンコーダは、音声データを入力することによって当該音声データに対応付けられた対象物を示す特徴ベクトルを出力するように機械学習されたスピーチエンコーダを用いて、対象物の画像データが入力されたときに出力される特徴ベクトルが当該画像データに対応付けられた音声データが入力されたときに前記スピーチエンコーダから出力される特徴ベクトルに一致するように機械学習され、前記音声合成器は、対象物の音声データが入力されたときに前記スピーチエンコーダから出力される特徴ベクトルと、当該音声データに対応付けられた内容情報と、が入力されたときに合成して出力される音声の音声データが前記スピーチエンコーダに入力された音声データと一致するように機械学習されていることを特徴とする音声合成方法である。 Another aspect of the present invention is a speech synthesis method using a data set in which image data of an object, audio data of a speech uttered by the object, and content information indicating the content of the audio data are associated with each other, wherein the image An image encoder that receives data input and outputs a feature vector for image data; a feature vector generated by the image encoder; and a speech synthesizer that synthesizes and outputs speech as if the object indicated by is uttering the content corresponding to the content information, and the image encoder inputs the speech data by inputting the speech data. Using a speech encoder machine-learned to output a feature vector indicating an object associated with audio data, the feature vector output when image data of the object is input corresponds to the image data. Machine learning is performed to match the feature vector output from the speech encoder when the attached speech data is input, and the speech synthesizer receives the speech data from the speech encoder when the speech data of the target object is input. When the feature vector to be output and the content information associated with the speech data are input, the voice data of the voice synthesized and output is matched with the voice data input to the speech encoder. This speech synthesis method is characterized by being machine-learned.

ここで、前記対象物は人物であり、前記音声合成器は、前記画像エンコーダによって生成された特徴ベクトルと、生成する音声の内容を示す内容情報と、の入力を受けて、当該画像データが示す人物が当該内容情報に対応する内容を発したような音声を合成して出力することが好適である。 Here, the object is a person, and the speech synthesizer receives the input of the feature vector generated by the image encoder and the content information indicating the content of the speech to be generated. It is preferable to synthesize and output a voice as if a person uttered the content corresponding to the content information.

また、前記音声合成器の機械学習に用いられる音声データは、前記スピーチエンコーダの機械学習に用いられる音声データよりクリーンであることが好適である。 Also, the speech data used for machine learning of the speech synthesizer is preferably cleaner than the speech data used for machine learning of the speech encoder.

また、前記音声合成器の機械学習に用いられる音声データは、ＶＣＴＫ又はＬｉｂｒｉＴＴＳに含まれる音声データであり、前記スピーチエンコーダの機械学習に用いられる音声データは、ＶｏｘＣｅｌｅｂ２の動画サイトから抽出された音声データであることが好適である。 The audio data used for machine learning of the speech synthesizer is audio data included in VCTK or LibriTTS, and the audio data used for machine learning of the speech encoder is audio data extracted from the VoxCeleb2 video site. is preferred.

また、前記画像エンコーダは、同一の対象物に対応付けられた複数の音声データが入力されたときに前記スピーチエンコーダから出力される特徴ベクトルの平均値と、当該対象物の画像データが入力されたときに出力される特徴ベクトルと、の差が小さくなるように機械学習されることが好適である。 Further, the image encoder receives an average value of feature vectors output from the speech encoder when a plurality of audio data associated with the same object is input, and image data of the object. It is preferable that machine learning is performed so that the difference between the characteristic vector and the output characteristic vector is small.

また、前記スピーチエンコーダは、同一の対象物に対応付けられた複数の音声データが入力されたときに出力される特徴ベクトルの平均値と、当該対象物に対応付けられた他の音声データが入力されたときに出力される特徴ベクトルと、の差が小さくなるように機械学習されることが好適である。 Further, the speech encoder receives an average value of feature vectors output when a plurality of audio data associated with the same object is input, and other audio data associated with the object. Machine learning is preferably performed so that the difference between the feature vector output when the

本発明の実施の形態は、画像から音声を合成する音声合成装置、音声合成プログラム及び音声合成方法を提供することを目的の１つとする。本発明の実施の形態の他の目的は、本明細書全体を参照することにより明らかになる。 One object of the embodiments of the present invention is to provide a speech synthesizer, a speech synthesis program, and a speech synthesis method for synthesizing speech from an image. Other objects of embodiments of the present invention will become apparent by reference to the specification as a whole.

本発明の実施の形態における音声合成装置の構成を示す図である。1 is a diagram showing the configuration of a speech synthesizer according to an embodiment of the present invention; FIG. 本発明の実施の形態における音声合成方法を示すフローチャートである。It is a flowchart which shows the speech-synthesis method in embodiment of this invention. 本発明の実施の形態における音声合成装置の音声合成モデルを示す図である。FIG. 3 is a diagram showing a speech synthesis model of the speech synthesizer according to the embodiment of the present invention; FIG. 本発明の実施の形態における顔画像データを説明するための図である。It is a figure for demonstrating the face image data in embodiment of this invention. 本発明の実施の形態におけるスピーチエンコーダの構築方法を説明するための図である。FIG. 4 is a diagram for explaining a method of constructing a speech encoder according to an embodiment of the present invention; FIG. 本発明の実施の形態における画像エンコーダの構築方法を説明するための図である。It is a figure for demonstrating the construction method of the image encoder in embodiment of this invention. 本発明の実施の形態における複数話者ＴＴＳの構築方法を説明するための図である。FIG. 4 is a diagram for explaining a method of constructing a multi-speaker TTS according to the embodiment of the present invention;

本発明の実施の形態における音声合成装置１００は、図１に示すように、処理部１０、記憶部１２、入力部１４、出力部１６及び通信部１８を含んで構成される。 A speech synthesizer 100 according to the embodiment of the present invention includes a processing unit 10, a storage unit 12, an input unit 14, an output unit 16, and a communication unit 18, as shown in FIG.

音声合成装置１００は、一般的なコンピュータにより構成することができる。処理部１０は、ＣＰＵ等を含んで構成され、音声合成装置１００における処理を統合的に行う。処理部１０は、記憶部１２に記憶されている音声合成プログラムを実行することにより、本実施の形態における音声合成処理を行う。記憶部１２は、音声合成処理において用いられる音声合成モデル（スピーチエンコーダ、画像エンコーダ、複数話者ＴＴＳ（Ｔｅｘｔ－Ｔｏ－Ｓｐｅｅｃｈ））、モデル生成に必要な顔画像データ、音声データ、テキストデータ等、音声合成処理において必要な情報を記憶する。記憶部１２は、例えば、半導体メモリ、ハードディスク等で構成することができる。記憶部１２は、音声合成装置１００の内部に設けてもよいし、無線や有線等の情報網を利用して処理部１０からアクセスできるように外部に設けてもよい。入力部１４は、音声合成装置１００に対して情報を入力するための手段を含む。出力部１６は、音声合成装置１００において処理された情報を表示させる手段を含む。通信部１８は、外部の装置（サーバ等）との情報交換を行うためのインターフェースを含んで構成される。通信部１８は、例えば、インターネット等の情報通信網に接続されることによって、外部の装置との通信を可能にする。 The speech synthesizer 100 can be configured with a general computer. The processing unit 10 includes a CPU and the like, and integrally performs processing in the speech synthesizer 100 . The processing unit 10 executes the speech synthesis program stored in the storage unit 12 to perform speech synthesis processing according to the present embodiment. The storage unit 12 stores speech synthesis models (speech encoder, image encoder, multi-speaker TTS (Text-To-Speech)) used in speech synthesis processing, face image data, speech data, text data, etc. necessary for model generation, Stores information necessary for speech synthesis processing. The storage unit 12 can be composed of, for example, a semiconductor memory, a hard disk, or the like. The storage unit 12 may be provided inside the speech synthesizer 100, or may be provided outside so as to be accessible from the processing unit 10 using an information network such as wireless or wired. Input unit 14 includes means for inputting information to speech synthesizer 100 . The output unit 16 includes means for displaying information processed by the speech synthesizer 100 . The communication unit 18 includes an interface for exchanging information with an external device (server or the like). The communication unit 18 enables communication with external devices by being connected to an information communication network such as the Internet.

［音声合成装置の構築］
以下、図２のフローチャートを参照して、本実施の形態における音声合成装置の構成方法について説明する。音声合成装置１００は、音声合成プログラムを実行することによって、音声合成モデル（スピーチエンコーダ、複数話者ＴＴＳ、画像エンコーダ）のための機械学習を行うことによって構成される。音声合成装置１００を用いることによって、音声合成モデルに基づいて音声を自動合成する処理を行うことができる。 [Construction of speech synthesizer]
A method for configuring the speech synthesizer according to the present embodiment will be described below with reference to the flowchart of FIG. The speech synthesizer 100 is configured by performing machine learning for speech synthesis models (speech encoder, multi-speaker TTS, image encoder) by executing a speech synthesis program. By using the speech synthesizer 100, processing for automatically synthesizing speech based on a speech synthesis model can be performed.

音声合成装置１００の音声合成モデルは、図３に示すように、画像エンコーダ１０２及び複数話者ＴＴＳ１０４を含んで構成される。音声合成装置１００は、機械学習によってスピーチエンコーダ、画像エンコーダ及び複数話者ＴＴＳを組み合わせて構成される。 The speech synthesis model of the speech synthesizer 100 includes an image encoder 102 and a multi-speaker TTS 104, as shown in FIG. The speech synthesizer 100 is configured by combining a speech encoder, an image encoder and a multi-speaker TTS by machine learning.

本実施の形態の音声合成モデルの構築には、テキストデータ、顔画像データ及び音声データのセットが用いられる。テキストデータは、話者の発話を音声の内容を文字や音素で表したデータである。テキストデータは、音声合成装置１００によって生成される音声の内容を示す内容情報として使用される。顔画像データは、話者の顔を示す画像である。音声データは、テキストデータに含まれる文字や音素に対応する音声のデータである。ここでは、音声データは、話者が発した音声のデータとしたが、何らかの対象物が発した音のデータを含むものとする。音声合成モデルの機械学習には、話者の顔画像データと当該話者が発話したテキストデータに対応する音声データがセットとして用いられる。 A set of text data, face image data, and voice data is used to construct the speech synthesis model of the present embodiment. The text data is data in which the content of the speech of the speaker is represented by characters or phonemes. The text data is used as content information indicating the content of speech generated by the speech synthesizer 100 . The face image data is an image representing the speaker's face. Voice data is voice data corresponding to characters and phonemes included in text data. Here, the voice data is the data of the voice uttered by the speaker, but it is assumed that the data of the sound uttered by some object is also included. For machine learning of the speech synthesis model, face image data of a speaker and speech data corresponding to text data uttered by the speaker are used as a set.

本実施の形態では、顔画像データと音声データの組み合わせとしてＶｏｘ－Ｃｅｌｅｂ２とＶＧＧＦａｃｅ２を用いた。ＶｏｘＣｅｌｅｂ２は、６０００人以上の有名人の発話（音声データ）を動画サイトから抽出したデータセットである。ＶｏｘＣｅｌｅｂ２によって、多様な性別・国籍の話者について顔画像データと音声データとが対応したデータセットを得ることができる。しかしながら、本実施の形態では、動画から切り出された顔画像データは解像度が低いため、ＶｏｘＣｅｌｅｂ２と同一の人物を含んだ画像のデータセットであるＶＧＧＦａｃｅ２から顔画像データを用意し、ＶｏｘＣｅｌｅｂ２の音声データと組み合わせて使用した。 In this embodiment, Vox-Celeb2 and VGGFace2 are used as a combination of face image data and voice data. VoxCeleb2 is a data set obtained by extracting utterances (audio data) of more than 6000 celebrities from video sites. With VoxCeleb2, it is possible to obtain a data set in which face image data and voice data correspond to speakers of various genders and nationalities. However, in the present embodiment, face image data cut out from a moving image has a low resolution. Therefore, face image data is prepared from VGGFace2, which is a data set of images containing the same person as VoxCeleb2, and is combined with voice data of VoxCeleb2. used in combination.

音声データは、例えば、サンプリング周波数１６ｋＨｚにダウンサンプリングして使用すればよい。ただし、サンプリング周波数は、これに限定されるものではなく、他のサンプリング周波数を使用してもよい。 Audio data may be used after being down-sampled to a sampling frequency of 16 kHz, for example. However, the sampling frequency is not limited to this, and other sampling frequencies may be used.

図４は、ＶｏｘＣｅｌｅｂ２において動画から切り出した話者の顔画像と、ＶＧＧＦａｃｅ２において対応する話者の顔画像と、を比較して示した図である。図４に示されるように、ＶＧＧＦａｃｅ２における顔画像は、ＶｏｘＣｅｌｅｂ２における顔画像より解像度が高い。 FIG. 4 is a diagram showing a comparison between a face image of a speaker cut out from a moving image in VoxCeleb2 and a corresponding face image of the speaker in VGGFace2. As shown in FIG. 4, the face image in VGGFace2 has a higher resolution than the face image in VoxCeleb2.

テキストデータと音声データのデータセットは、ＶＣＴＫとＬｉｂｒｉＴＴＳを用いた。ＶＣＴＫは、１００以上の話者による９００００以上の発話のデータセットを含む。ＬｉｂｒｉＴＴＳは、８００以上の話者による１８０００以上の発話のデータセットを含む。ＶＣＴＫとＬｉｂｒｉＴＴＳのいずれのデータセットにおける音声データもＶｏｘＣｅｌｅｂ２における音声データよりもバックグラウンドノイズは少ないクリーンな音声である。 VCTK and LibriTTS were used as data sets of text data and voice data. VCTK contains a dataset of over 90,000 utterances by over 100 speakers. LibriTTS contains a dataset of over 18000 utterances by over 800 speakers. The voice data in both the VCTK and LibriTTS data sets are cleaner voices with less background noise than the voice data in VoxCeleb2.

なお、本実施の形態では、音声の内容を文字として表現したテキストデータを用いたがこれに限定されるものではない。テキストデータの代わりに、又は、テキストデータに加えて、音声の内容を他の方法で表したデータとしてもよい。例えば、音声の内容を音素で表した音素データとしてもよいし、音声の内容を話者の顔の表情の変化で表した動画データとしてもよい。また、テキストデータの代わりに、音声データを用いてもよい。 Note that in the present embodiment, text data expressing the content of speech as characters is used, but the present invention is not limited to this. Instead of text data, or in addition to text data, data representing voice content in other ways may be used. For example, it may be phoneme data in which the content of the voice is represented by phonemes, or moving image data in which the content of the voice is represented by changes in facial expressions of the speaker. Also, voice data may be used instead of text data.

また、本実施の形態では、話者を表すために実存する人物の顔画像データを用いたがこれに限定されるものではない。例えば、実存する人物の顔画像データの代わりに、アニメーション等におけるキャラクタの顔画像データや３次元の人物モデルにおける顔画像データとしてもよい。 Moreover, in the present embodiment, face image data of an existing person is used to represent the speaker, but the present invention is not limited to this. For example, instead of face image data of an existing person, face image data of a character in an animation or the like or face image data of a three-dimensional human model may be used.

ステップＳ１０では、機械学習によってスピーチエンコーダが構築される。本ステップにおける処理によって、音声合成装置１００はスピーチエンコーダ構築手段として機能する。スピーチエンコーダ１０６は、図５に示すように、話者毎に対応付けられた複数の音声データのデータセットを用いた機械学習によって、音声データを入力することによって当該音声データを発話した話者を示す話者特徴ベクトルを出力するように構築される。 At step S10, a speech encoder is constructed by machine learning. Through the processing in this step, the speech synthesizer 100 functions as speech encoder construction means. As shown in FIG. 5, the speech encoder 106 receives speech data by machine learning using a plurality of data sets of speech data associated with each speaker, and determines the speaker who uttered the speech data. is constructed to output the speaker feature vector shown.

具体的には、同じ話者に対する複数の音声データをミニバッチとして、当該ミニバッチ内に含まれる１つの音声データをスピーチエンコーダ１０６に入力したときに出力される話者特徴ベクトルが、当該ミニバッチに含まれる他の音声データをスピーチエンコーダ１０６に入力したときに出力される話者特徴ベクトルの平均ベクトルに近づくように機械学習を行う。 Specifically, a plurality of speech data for the same speaker are treated as a mini-batch, and a speaker feature vector output when one speech data included in the mini-batch is input to the speech encoder 106 is included in the mini-batch. Machine learning is performed so as to approach the average vector of speaker feature vectors output when other speech data is input to the speech encoder 106 .

本実施の形態では、スピーチエンコーダ１０６の機械学習には、ＶｏｘＣｅｌｅｂ２の動画サイトから抽出された音声データを使用することが好適である。ＶｏｘＣｅｌｅｂ２に含まれる音声データにはバックグラウンドノイズ等の雑音やＢＧＭが混じっているデータやそれらが混じっていないクリーンな音声等が含まれている。スピーチエンコーダ１０６の学習において、雑音やＢＧＭが混じっている音声データを使用することによって、クリーンな音声のみを含む音声データを使用した場合に比べて音声のクリーンさにも依存しない話者特徴ベクトルが出力されるスピーチエンコーダ１０６を得ることができる。 In this embodiment, it is preferable to use voice data extracted from the VoxCeleb2 video site for machine learning of the speech encoder 106 . Audio data included in VoxCeleb2 includes data mixed with noise such as background noise and BGM, and clean audio that is not mixed with them. In the learning of the speech encoder 106, by using speech data mixed with noise and BGM, speaker feature vectors that do not depend on the cleanliness of the speech are obtained as compared to the case of using speech data containing only clean speech. An output speech encoder 106 can be obtained.

例えば、音声データは、窓長、ホップ長、ＦＦＴ長をそれぞれ４００サンプル（２５ｍｓ）、１６０サンプル（１０ｍｓ）、５１２サンプルとすればよい。窓関数は、ハン窓を使用すればよい。音声データの入力は、長さ１６０フレームの４０次元ｌｏｇ－Ｍｅｌスペクトログラムとし、スピーチエンコーダ１０６から出力される話者特徴ベクトルは２５６次元のベクトルとすればよい。スピーチエンコーダ１０６を構成するニューラルネットワークの隠れ層は、７６８次元の３層のＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ－ＴｅｒｍＭｅｍｏｒｙ）と、最終フレームに７６８次元から２５６次元に変換する線形層を組み合わせた構成とすればよい。誤差関数の計算のために得られた出力はＬ２正規化すればよい。また、学習率は１０^－５とし、最適化関数はＡｄａｍを用いればよい。ただし、スピーチエンコーダ１０６の構成は、これらの条件に限定されるものではなく、音声データを入力することによって当該音声データを発話した話者を示す適切な話者特徴ベクトルを出力する構成とすればよい。 For example, the audio data may have window length, hop length, and FFT length of 400 samples (25 ms), 160 samples (10 ms), and 512 samples, respectively. A Hann window may be used as the window function. A 40-dimensional log-Mel spectrogram having a length of 160 frames may be input as speech data, and a speaker feature vector output from the speech encoder 106 may be a 256-dimensional vector. The hidden layer of the neural network that constitutes the speech encoder 106 should be a combination of a 768-dimensional three-layer LSTM (Long Short-Term Memory) and a linear layer that converts from 768 dimensions to 256 dimensions in the final frame. . The output obtained for the calculation of the error function may be L2 normalized. Also, the learning rate is set to 10 ⁻⁵ and Adam is used as the optimization function. However, the configuration of the speech encoder 106 is not limited to these conditions, as long as it is configured to output an appropriate speaker feature vector indicating the speaker who uttered the speech data by inputting the speech data. good.

ステップＳ１２では、機械学習によって画像エンコーダ１０２が構築される。本ステップにおける処理によって、音声合成装置１００は画像エンコーダ構築手段として機能する。画像エンコーダ１０２は、話者の顔画像データを入力したときに適切な話者特徴ベクトルが出力されるように構築される。すなわち、画像エンコーダ１０２は、話者の顔画像データを入力することによって当該話者に対応する特徴ベクトルを出力する画像エンコーダとして機能する。 In step S12, the image encoder 102 is constructed by machine learning. Through the processing in this step, the speech synthesizer 100 functions as image encoder construction means. The image encoder 102 is constructed so as to output an appropriate speaker feature vector when face image data of a speaker is input. That is, the image encoder 102 functions as an image encoder that outputs a feature vector corresponding to the speaker by inputting face image data of the speaker.

本実施の形態では、図６に示すように、話者毎に対応付けられた音声データと顔画像データのペア（データセット）をそれぞれスピーチエンコーダ１０６と画像エンコーダ１０２に入力した場合にスピーチエンコーダ１０６から出力される話者特徴ベクトルと画像エンコーダ１０２から出力される話者特徴ベクトルとができるだけ一致するように機械学習を行う。 In this embodiment, as shown in FIG. 6, when pairs (data sets) of voice data and face image data associated with each speaker are input to the speech encoder 106 and the image encoder 102 respectively, the speech encoder 106 Machine learning is performed so that the speaker feature vector output from the image encoder 102 matches the speaker feature vector output from the image encoder 102 as much as possible.

例えば、ステップＳ１０における機械学習によって構築されたスピーチエンコーダ１０６に対してＶｏｘＣｅｌｅｂ２における音声データを話者毎に入力して得られた話者特徴ベクトルの平均ベクトルを教師ベクトルとして当該話者の顔画像データと組み合わせて教師付学習用データセットとする。そして、顔画像データを入力したときに画像エンコーダ１０２から出力される話者特徴量ベクトルが当該顔画像データに対応する話者の教師ベクトル（平均話者特徴ベクトル）にできるだけ近づき、異なる話者に対する教師ベクトルからできるだけ遠ざかるように画像エンコーダ１０２を機械学習させる。より具体的には、例えば、Ｓｏｆｔｍａｘ損失を適用したGE2E損失（generalized end-to-end損失）ができるだけ小さくなるように機械学習を行えばよい。 For example, the speech encoder 106 constructed by machine learning in step S10 is input with speech data in VoxCeleb2 for each speaker, and the average vector of the speaker feature vectors obtained is used as a teacher vector to obtain face image data of the speaker. are combined to form a dataset for supervised learning. Then, when face image data is input, the speaker feature amount vector output from the image encoder 102 becomes as close as possible to the teacher vector (average speaker feature vector) of the speaker corresponding to the face image data, and the The image encoder 102 is machine-learned so as to be as far away from the teacher vector as possible. More specifically, for example, machine learning may be performed so that GE2E loss (generalized end-to-end loss) to which Softmax loss is applied is minimized.

なお、画像エンコーダ１０２の出力は、例えば、２５６次元の特徴ベクトルとなるようにすればよい。また、画像エンコーダ１０２を構成するニューラルネットワークとしては、例えば、ＶＧＧ１９等の畳み込みニューラルネットワークを適用すればよい。誤差関数は、話者特徴の誤差関数であるＧＥ２Ｅで逐次計算していた重心をスピーチエンコーダで学習した特徴ベクトルで置き換えたＳｕｐｅｒｖｉｓｅｄＧＥ２Ｅ損失を用いればよい。 Note that the output of the image encoder 102 may be, for example, a 256-dimensional feature vector. Also, as the neural network forming the image encoder 102, for example, a convolutional neural network such as VGG19 may be applied. As the error function, a supervised GE2E loss in which the center of gravity sequentially calculated by GE2E, which is the error function of speaker features, is replaced with a feature vector learned by a speech encoder may be used.

ここで、画像エンコーダ１０２の機械学習では、ＶｏｘＣｅｌｅｂ２の動画サイトから抽出された話者の顔画像データではなく、ＶＧＧＦａｃｅにおいて当該話者に対応する顔画像データを用いることが好適である。すなわち、画像エンコーダ１０２の機械学習に用いる音声データはＶｏｘＣｅｌｅｂ２において動画サイトから抽出された各話者の音声データとし、画像エンコーダ１０２の機械学習に用いる顔画像データはＶＧＧＦａｃｅにおいて当該話者に対応する顔画像データとすることが好適である。これは、図４に示したように、ＶｏｘＣｅｌｅｂ２よりＶＧＧＦａｃｅ２の顔画像の解像度が高いためである。これによって、顔画像の特徴をより適切に捉えた話者特徴ベクトルを生成する画像エンコーダ１０２を構成することができる。なお、顔画像データは、例えば、１６０ドット×１６０ドットの画像データとすればよい。 Here, in the machine learning of the image encoder 102, it is preferable to use face image data corresponding to the speaker in VGGFace instead of the face image data of the speaker extracted from the VoxCeleb2 video site. That is, the voice data used for machine learning of the image encoder 102 is the voice data of each speaker extracted from the video site in VoxCeleb2, and the face image data used for the machine learning of the image encoder 102 is the face corresponding to the speaker in VGGFace. Image data is preferable. This is because the face image resolution of VGGFace2 is higher than that of VoxCeleb2, as shown in FIG. This makes it possible to configure the image encoder 102 that generates a speaker feature vector that more appropriately captures the features of the face image. The face image data may be image data of 160 dots×160 dots, for example.

ステップＳ１４では、機械学習によって複数話者ＴＴＳ１０４が構築される。本ステップにおける処理によって、音声合成装置１００は複数話者ＴＴＳ構築手段として機能する。複数話者ＴＴＳ１０４は、話者の特徴を示す話者特徴ベクトルと合成する音声の内容を示すテキストデータを入力したときに当該話者が当該テキストデータの内容を発話したような音声を合成して出力するように構築される。すなわち、複数話者ＴＴＳ１０４は、特徴ベクトルとテキストデータとの入力を受けて、当該特徴ベクトルと当該テキストデータに対応する音声を合成して出力する音声合成器（speech synthesizer）として機能する。 In step S14, a multi-speaker TTS 104 is constructed by machine learning. Through the processing in this step, the speech synthesizer 100 functions as multi-speaker TTS constructing means. The multi-speaker TTS 104 synthesizes speech as if the speaker uttered the content of the text data when the speaker feature vector indicating the feature of the speaker and the text data indicating the content of the speech to be synthesized were input. Built to output. That is, the multi-speaker TTS 104 functions as a speech synthesizer that receives input of feature vectors and text data, synthesizes speech corresponding to the feature vectors and the text data, and outputs the synthesized speech.

本実施の形態では、図７に示すように、音声データと当該音声データに対応するテキストデータとのデータセットを学習用データとして用いて機械学習を行って複数話者ＴＴＳ１０４を構築する。ステップＳ１０において構築したスピーチエンコーダ１０６に対して学習用データに含まれる音声データを入力し、当該音声データに対してスピーチエンコーダ１０６から出力される話者特徴ベクトルを複数話者ＴＴＳ１０４に入力する。また、スピーチエンコーダ１０６に入力した音声データに対応するテキストデータを複数話者ＴＴＳ１０４へ入力する。ここで、テキストデータは、既存のテキスト解析手段を用いて、言語学的特徴量（例えば、品詞、読み、モーラ数、アクセント型・句などが挙げられる）に変換して複数話者ＴＴＳ１０４へ入力するようにしてもよい。これらの入力を受けて、複数話者ＴＴＳ１０４から入力した音声データと同じ音声データが出力されるように機械学習を行う。 In this embodiment, as shown in FIG. 7, multi-speaker TTS 104 is constructed by performing machine learning using a data set of speech data and text data corresponding to the speech data as learning data. Speech data included in the learning data is input to the speech encoder 106 constructed in step S10, and speaker feature vectors output from the speech encoder 106 for the speech data are input to the multi-speaker TTS 104. FIG. Also, text data corresponding to the voice data input to the speech encoder 106 is input to the multi-speaker TTS 104 . Here, the text data is converted into linguistic features (for example, part of speech, reading, number of moras, accent type/phrase, etc.) using existing text analysis means, and is input to multi-speaker TTS 104. You may make it Upon receiving these inputs, machine learning is performed so that the same voice data as the voice data input from the multi-speaker TTS 104 is output.

例えば、複数話者ＴＴＳ１０４では継続長推定と音響特徴量推定とを組み合わせる。継続長推定と音響特徴量推定のどちらにも双方向ＬＳＴＭを適用する。継続長推定における入力は、音素毎の言語特徴量と発話毎の特徴ベクトルを連結した値とする。継続長推定における出力は、音素継続長に当たるフレーム数とする。音響特徴量推定の入力は、フレーム毎の言語特徴量と発話毎の特徴量ベクトルを連結した値とする。音響特徴量推定の出力は、フレーム毎の音響特徴量（声の高さを示すF0、声道の特徴量を示すスペクトル包絡（例えばメルケプストラム）、声のかすれ具合を示す非周期性指標）とする。また、継続長推定と音響特徴量推定のモデルの誤差関数は二乗誤差を使用すればよい。 For example, multi-speaker TTS 104 combines duration estimation and acoustic feature estimation. We apply a bidirectional LSTM for both duration estimation and acoustic feature estimation. The input in duration estimation is a value obtained by concatenating the linguistic feature amount for each phoneme and the feature vector for each utterance. The output in duration estimation is the number of frames corresponding to the phoneme duration. The input for acoustic feature quantity estimation is a value obtained by concatenating the language feature quantity for each frame and the feature quantity vector for each utterance. The output of the acoustic feature quantity estimation is the acoustic feature quantity for each frame (F0 indicating the pitch of the voice, the spectral envelope (e.g. mel-cepstrum) indicating the feature quantity of the vocal tract, and the aperiodicity index indicating the degree of hoarseness of the voice). do. Also, the squared error may be used as the error function of the models for duration estimation and acoustic feature quantity estimation.

ここで、テキストデータと音声データのデータセットは、ＶＣＴＫとＬｉｂｒｉＴＴＳを用いることが好適である。すなわち、ＶＣＴＫとＬｉｂｒｉＴＴＳのデータセットにおける音声データは、ＶｏｘＣｅｌｅｂ２における音声データよりも雑音等が少ないクリーンな音声であるので、よりクリーンな音声を合成して出力する複数話者ＴＴＳ１０４を構築することができる。 Here, it is preferable to use VCTK and LibriTTS for the data sets of text data and voice data. That is, since the speech data in the VCTK and LibriTTS data sets is cleaner speech with less noise than the speech data in VoxCeleb2, it is possible to construct a multi-speaker TTS 104 that synthesizes and outputs cleaner speech. .

ステップＳ１６では、音声合成装置１００の音声合成モデルが構築される。本ステップにおける処理によって、音声合成装置１００は音声合成モデル構築手段として機能する。音声合成装置１００における音声合成モデルは、ステップＳ１２で構築された画像エンコーダ１０２とステップＳ１４で構築された複数話者ＴＴＳ１０４を組み合わせ構成される。すなわち、図３に示すように、テキストデータ及びステップＳ１２において構築された画像エンコーダ１０２から出力される話者特徴ベクトルが複数話者ＴＴＳ１０４に入力される。複数話者ＴＴＳ１０４では、入力されたテキストデータ及び話者特徴ベクトルに応じた音声が合成される。このようにして、画像エンコーダ１０２に入力された顔画像データに対応する話者によってテキストデータに対応する内容が発話されたような音声を合成して出力する音声合成装置１００を構成することができる。 At step S16, a speech synthesis model for the speech synthesizer 100 is constructed. Through the processing in this step, the speech synthesizer 100 functions as speech synthesis model building means. The speech synthesis model in the speech synthesizer 100 is configured by combining the image encoder 102 constructed in step S12 and the multi-speaker TTS 104 constructed in step S14. That is, as shown in FIG. 3, the text data and the speaker feature vector output from the image encoder 102 constructed in step S12 are input to the multi-speaker TTS 104. FIG. The multi-speaker TTS 104 synthesizes speech according to the input text data and speaker feature vectors. In this way, it is possible to configure the speech synthesizer 100 that synthesizes and outputs speech as if the content corresponding to the text data was uttered by the speaker corresponding to the face image data input to the image encoder 102 . .

［音声合成処理］
以下、図３を参照して、音声合成装置１００によって音声を合成する処理について説明する。音声合成をする際、音声合成装置１００における画像エンコーダ１０２に話者とする人物の顔画像データを入力する。これによって、画像エンコーダ１０２では入力された顔画像データに応じた話者特徴ベクトルを生成して出力する。出力された話者特徴ベクトルは複数話者ＴＴＳ１０４に入力される。また、音声合成装置１００における複数話者ＴＴＳ１０４に合成する音声の内容を示すテキストデータを入力する。これによって、複数話者ＴＴＳ１０４では、入力されたテキストデータ及び話者特徴ベクトルに応じた音声が合成される。このようにして、画像エンコーダ１０２に入力された顔画像データに対応する話者によってテキストデータに対応する内容が発話されたような音声が合成されて出力される。 [Speech synthesis processing]
Processing for synthesizing speech by the speech synthesizing device 100 will be described below with reference to FIG. When synthesizing speech, face image data of a speaker is input to the image encoder 102 in the speech synthesizing device 100 . As a result, the image encoder 102 generates and outputs a speaker feature vector corresponding to the input face image data. The output speaker feature vector is input to multi-speaker TTS 104 . Also, text data indicating the content of the voice to be synthesized is input to the multi-speaker TTS 104 in the voice synthesizer 100 . As a result, the multi-speaker TTS 104 synthesizes speech according to the input text data and speaker feature vectors. In this way, a voice is synthesized and output as if the speaker corresponding to the face image data input to the image encoder 102 uttered the content corresponding to the text data.

以上のように、本実施の形態における音声合成装置１００では、テキストデータ及び話者の顔画像データを入力として、当該話者が発したような適切な音声を合成して出力することができる。 As described above, the speech synthesizing apparatus 100 according to the present embodiment can receive text data and face image data of a speaker as input, synthesize and output an appropriate speech as if it were uttered by the speaker.

ただし、本実施の形態の適用範囲は人物の音声合成に限定されるものではない。例えば、動物の画像データ、動物の発する声及び当該声の内容を示す情報を用いて機械学習させることによって、動物の声を合成する音声合成装置１００を構築することもできる。また、例えば、自動車や電車等の移動体の画像データ、当該移動体が発生させる音及び当該音の内容を示す移動体の動く様子を示す動画情報を用いて機械学習させることによって、移動体の音を合成する音声合成装置１００を構築することもできる。 However, the scope of application of the present embodiment is not limited to human speech synthesis. For example, the speech synthesizer 100 that synthesizes animal voices can be constructed by performing machine learning using image data of animals, voices uttered by animals, and information indicating the content of the voices. In addition, for example, machine learning can be performed using image data of moving objects such as automobiles and trains, sounds generated by the moving objects, and video information showing the movement of the moving objects indicating the content of the sounds. A speech synthesizer 100 that synthesizes sounds can also be constructed.

また、テキストデータに代えて音声データを用いて音声合成装置１００を構築してもよい。この場合、入力された音声データの内容を画像エンコーダ１０２に入力された顔画像データの人物が発したような音声が合成されて出力される。したがって、入力データを変更して出力するボイスチェンジャーのように使用することができる。 Also, the speech synthesizer 100 may be constructed using speech data instead of text data. In this case, the content of the input voice data is synthesized with the voice of the person of the face image data input to the image encoder 102 and output. Therefore, it can be used like a voice changer that changes and outputs input data.

なお、本実施の形態における音声合成装置１００では、各構成要素を１つの装置にて実現する構成としたが、各構成要素を異なる装置や異なる実行主体にて実現するようにしてもよい。例えば、各構成要素のうち幾つかを複数のコンピュータで分担して実現するようにしてもよい。 Note that the speech synthesizing apparatus 100 according to the present embodiment is configured such that each component is realized by one device, but each component may be realized by different devices or different execution entities. For example, some of the components may be shared by a plurality of computers.

１０処理部、１２記憶部、１４入力部、１６出力部、１８通信部、１００音声合成装置、１０２画像エンコーダ、１０４複数話者ＴＴＳ、１０６スピーチエンコーダ。
10 processing unit, 12 storage unit, 14 input unit, 16 output unit, 18 communication unit, 100 speech synthesizer, 102 image encoder, 104 multi-speaker TTS, 106 speech encoder.

Claims

A speech synthesizer constructed by machine learning using a data set in which image data of an object, audio data of a sound emitted by the object, and content information indicating the content of the audio data are associated,
an image encoder that receives input of image data and outputs a feature vector for the image data;
In response to the input of the feature vector generated by the image encoder and the content information indicating the content of the generated sound, the object indicated by the image data generates a sound as if the content corresponding to the content information was emitted. a speech synthesizer for synthesizing and outputting;
with
The image encoder is input with image data of an object using a speech encoder machine-learned to output a feature vector indicating an object associated with the audio data by inputting the audio data. Machine learning is performed so that the feature vector that is output when the speech encoder matches the feature vector that is output from the speech encoder when the audio data associated with the image data is input,
The speech synthesizer synthesizes and outputs the feature vector output from the speech encoder when the speech data of the object is input and the content information associated with the speech data when the speech synthesizer is input. 1. A speech synthesizer, wherein machine learning is performed so that the speech data of the speech to be generated matches the speech data inputted to the speech encoder.

The speech synthesizer according to claim 1,
the object is a person,
The voice synthesizer receives input of the feature vector generated by the image encoder and content information indicating the content of the voice to be generated, and the person indicated by the image data utters content corresponding to the content information. A speech synthesizer characterized by synthesizing and outputting speech such as

The speech synthesizer according to claim 1 or 2,
A speech synthesizer according to claim 1, wherein speech data used for machine learning of said speech synthesizer is cleaner than speech data used for machine learning of said speech encoder.

The speech synthesizer according to any one of claims 1 to 3,
The image encoder outputs an average value of feature vectors output from the speech encoder when a plurality of audio data associated with the same object is input, and when image data of the object is input, A speech synthesizer characterized in that machine learning is performed so that the difference between output feature vectors and feature vectors is reduced.

The speech synthesizer according to any one of claims 1 to 4,
The speech encoder receives an average value of feature vectors output when a plurality of audio data associated with the same object is input, and other audio data associated with the object. A speech synthesizer characterized in that machine learning is performed so that the difference between a feature vector that is output from time to time and a feature vector that is output from time to time is reduced.

A speech synthesis program using a data set in which image data of an object, audio data of a sound uttered by the object, and content information indicating the content of the audio data are associated,
the computer,
an image encoder that receives input of image data and outputs a feature vector for the image data;
In response to the input of the feature vector generated by the image encoder and the content information indicating the content of the generated sound, the object indicated by the image data generates a sound as if the content corresponding to the content information was emitted. a speech synthesizer for synthesizing and outputting;
function as
The image encoder is input with image data of an object using a speech encoder machine-learned to output a feature vector indicating an object associated with the audio data by inputting the audio data. Machine learning is performed so that the feature vector that is output when the speech encoder matches the feature vector that is output from the speech encoder when the audio data associated with the image data is input,
The speech synthesizer synthesizes and outputs the feature vector output from the speech encoder when the speech data of the object is input and the content information associated with the speech data when the speech synthesizer is input. 1. A speech synthesis program characterized in that machine learning is performed so that the speech data of the speech to be input matches the speech data input to the speech encoder.

A speech synthesis method using a data set in which image data of an object, audio data of a speech uttered by the object, and content information indicating the content of the audio data are associated,
an image encoder that receives input of image data and outputs a feature vector for the image data;
In response to the input of the feature vector generated by the image encoder and the content information indicating the content of the generated sound, the object indicated by the image data generates a sound as if the content corresponding to the content information was emitted. a speech synthesizer for synthesizing and outputting;
Synthesize the speech using
The image encoder is input with image data of an object using a speech encoder machine-learned to output a feature vector indicating an object associated with the audio data by inputting the audio data. Machine learning is performed so that the feature vector that is output when the speech encoder matches the feature vector that is output from the speech encoder when the audio data associated with the image data is input,
The speech synthesizer synthesizes and outputs the feature vector output from the speech encoder when the speech data of the object is input and the content information associated with the speech data when the speech synthesizer is input. 1. A speech synthesis method, wherein machine learning is performed so that the speech data of the speech to be generated matches the speech data inputted to the speech encoder.