JP2021148942A

JP2021148942A - Voice quality conversion system and voice quality conversion method

Info

Publication number: JP2021148942A
Application number: JP2020048518A
Authority: JP
Inventors: 慶華孫; Keika Son
Original assignee: Hitachi Solutions Technology Ltd
Current assignee: Hitachi Solutions Technology Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2021-09-27
Anticipated expiration: 2040-03-19
Also published as: JP7406418B2

Abstract

To provide a voice quality conversion system and a voice quality conversion method capable of stably performing voice quality conversion with high voice quality in performing voice quality conversion.SOLUTION: A voice quality conversion system includes a voice quality conversion data creation device and a voice quality conversion device. The voice quality conversion data creation device generates a dictionary with prosody information from a word dictionary in a PPG conversion model learning, applies morpheme analysis to text included in a voice corpus, generates a phoneme array with prosody information on the basis of a result of the morpheme analysis and a dictionary with dictionary prosody information, generates an acoustic model from the phoneme array with prosody information and a voice feature amount outputted as a result of a feature amount analysis of voice information included in the voice corpus, learns the acoustic model, and generates a PPG conversion model including the prosody information. Also, it generates a voice parameter generation model by using the PPG conversion model including the prosody information. The voice conversion device generates a voice parameter to input voice, and performs voice quality conversion on the basis of the voice parameter generation model.SELECTED DRAWING: Figure 5

Description

本発明は、声質変換システムおよび声質変換方法に係り、特に、音声の声質変換を行うにあたって、安定して高い音質の音質変換を可能にする声質変換システムおよび声質変換方法に関する。 The present invention relates to a voice quality conversion system and a voice quality conversion method, and more particularly to a voice quality conversion system and a voice quality conversion method that enable stable and high-quality sound quality conversion when performing voice quality conversion of voice.

近年、音声認識、機械翻訳、対話生成などの技術が飛躍的に向上してきたことを背景に、音声翻訳、音声対話サービス、サービスロボットなどの人工知能による音声コミュニケーションの実用化が急激に進んできた。その中に、声質変換（ＶＣ: Voice Conversion）技術が重要な技術の一つとして注目されている。声質変換とは、ある話者（source speaker）の発話に対して、含まれる発話内容と話し方を変えずに、別の話者（target speaker）の声に聞こえるように音声を編集する技術である。 In recent years, with the dramatic improvement in technologies such as voice recognition, machine translation, and dialogue generation, the practical application of voice communication by artificial intelligence such as voice translation, voice dialogue service, and service robot has rapidly progressed. .. Among them, voice conversion (VC) technology is attracting attention as one of the important technologies. Voice conversion is a technology that edits the voice of one speaker (source speaker) so that it can be heard by another speaker (target speaker) without changing the content of the utterance and the way of speaking. ..

近年、各社サービスロボットのプロトタイプが次々と開発され、ＰｏＣ（概念実証）が実施されている。このようなサービスロボットにおいては、音声認識や音声合成の技術は必須のものとなる。しかしながら、実環境（特に空港や駅など）では音声認識の精度が悪く、対話成功率が非常に低いという問題が生じる。このような結果、サービスロボットでの実戦配備が先延ばしとなり、リアルデータの蓄積ができなくなり、サービスロボットの市場成長を阻害する原因の一つとなっている。そこで、ロボットによる接客サービスの品質向上のために、音声認識や意図理解の精度向上研究と並行して、自動応答とオペレータ対応が連携した、ハイブリッド音声対話サービスが構想されている。 In recent years, prototypes of service robots of each company have been developed one after another, and PoC (proof of concept) is being carried out. In such service robots, speech recognition and speech synthesis techniques are indispensable. However, in the actual environment (especially at airports and train stations), the accuracy of voice recognition is poor, and there is a problem that the success rate of dialogue is very low. As a result, the actual deployment of service robots has been postponed, and real data cannot be accumulated, which is one of the causes that hinder the market growth of service robots. Therefore, in order to improve the quality of customer service by robots, a hybrid voice dialogue service is being conceived in which automatic response and operator response are linked in parallel with research on improving the accuracy of voice recognition and intention understanding.

この構想を実現するためには、ＴＴＳ（Text To Speech)で生成した自動対応音声とオペレータの肉声とがシームレスに切り替えられるため、オペレータの声をロボットの声に変換する声質変換技術が不可欠となる。 In order to realize this concept, the automatic response voice generated by TTS (Text To Speech) and the operator's real voice can be seamlessly switched, so a voice quality conversion technology that converts the operator's voice into the robot's voice is indispensable. ..

このような声質変換技術については、例えば、非特許文献１に、音素事後確率（ＰＰＧ：Phonetic Posterior Gram）を用いて声質変換を行うことが論じられている。 Regarding such a voice quality conversion technique, for example, Non-Patent Document 1 discusses that voice quality conversion is performed using a phonetic posterior probability (PPG).

L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING” Multimedia and Expo (ICME), 2016.L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING” Multimedia and Expo (ICME), 2016.

従来の声質変換技術では、入力音声の収録環境によって、声質変換の性能が著しく低下するなどといった課題があったが、非特許文献１に記載された声質変換技術は、そのような課題を解決することを意図している。非特許文献１の記載された技術は、入力音声の話者性と収録環境音を取り除き、音声認識で学習した音響モデルを用いて、音声特徴量を発話内容にかかわる情報のみが含まれるＰＰＧに変換することによって、安定した声質変換を実現しようとするものである。 The conventional voice quality conversion technology has a problem that the performance of voice quality conversion is remarkably deteriorated depending on the recording environment of the input voice, but the voice quality conversion technology described in Non-Patent Document 1 solves such a problem. Is intended to be. The technique described in Non-Patent Document 1 removes the speaker nature of the input voice and the recording environment sound, and uses an acoustic model learned by voice recognition to convert the voice feature amount into a PPG containing only information related to the utterance content. By converting, it is intended to realize stable voice quality conversion.

しかしながら、日本語音声認識で用いる音響モデルから生成されたＰＰＧは、日本語音素の音素事後確率であり、調音構造に関係しない韻律情報や非周期成分情報などの情報は含まれていないとされている。そのため、日本語ＰＰＧのみから基本周波数（Ｆ０）（音響特徴量の一つ、音声のインパルス列の間隔の逆数と定義される。声の高さに相当する）を推測することは難しい。従来研究では、声道構造に関係するＭＣＥＰ（メルケプトラム係数、メルケプトラムは、人の聴覚特性に合わせて低周波領域を細かくサンプリングする手法）のみをＰＰＧで変換し、韻律情報（ピッチなど）は線形変換する手法で変換する。そして、別々に変換したパラメータを使って音声を再構築する。しかし、声質変換においては別々で生成した音声パラメータを用いた場合、声質変換音質の劣化につながりやすいと、一般的知られている。特に、Ｆ０の抽出が非常に不安定であるため、安定した声質変換ができなかった。 However, the PPG generated from the acoustic model used in Japanese speech recognition is the phoneme posterior probability of Japanese phonemes, and it is said that it does not include information such as prosodic information and aperiodic component information that are not related to the tone structure. There is. Therefore, it is difficult to estimate the fundamental frequency (F0) (one of the acoustic features, the reciprocal of the interval of the impulse train of the voice, which corresponds to the pitch of the voice) only from the Japanese PPG. In conventional research, only MCEP (merkeptrum coefficient, merkeptrum is a method of finely sampling the low frequency region according to human auditory characteristics) related to vocal tract structure is converted by PPG, and prosodic information (pitch, etc.) is linearly converted. Convert by the method of. Then, the audio is reconstructed using the separately converted parameters. However, it is generally known that when voice parameters generated separately are used in voice quality conversion, it tends to lead to deterioration of voice quality conversion sound quality. In particular, since the extraction of F0 was very unstable, stable voice quality conversion could not be performed.

本発明の目的は、音声の声質変換を行うにあたって、安定して高い音質の音質変換を可能にする声質変換システムおよび声質変換方法を提供することにある。 An object of the present invention is to provide a voice quality conversion system and a voice quality conversion method that enable stable and high-quality sound quality conversion in performing voice quality conversion of voice.

本発明の声質変換システムの構成は、好ましくは、情報処理装置により、入力音声から声質を変換した合成音声を出力する声質変換システムであって、声質変換用データ作成装置と、声質変換装置とを備え、声質変換用データ作成装置は、単語辞書とテキストと音声情報を対応付けた音声コーパスとを入力して、ＰＰＧ変換モデルを生成するＰＰＧ（音素事後確率）モデル学習部と、音声コーパスとＰＰＧ変換モデルを入力して音声パラメータ生成モデルを生成する音声パラメータ生成モデル学習部とよりなり、ＰＰＧ変換モデル学習部は、単語辞書から韻律情報付き辞書を生成し、音声コーパスに含まれるテキストを形態素解析して、形態素解析の結果と、韻律情報付き辞書に基づいて、韻律情報付き音素配列を生成し、韻律情報付き音素配列と、音声コーパスに含まれる音声情報の特徴量解析の結果として出力される音声特徴量とから、音響モデルを生成し、音響モデルを学習して、ＰＰＧ変換モデルを生成し、音声パラメータ生成モデル学習部は、ＰＰＧ変換モデルと音声コーパスより、ＰＰＧ変換モデルに対応するＰＰＧを生成し、音声コーパスに含まれる音声情報から音声パラメータを抽出し、生成されたＰＰＧと音声パラメータを学習して、音声パラメータ生成モデルを生成し、声質変換装置は、入力音声とＰＰＧ変換モデル学習部が生成したＰＰＧ変換モデルと音声パラメータ生成モデル学習部が生成した音声パラメータ生成モデルとを入力し、音声コーパスに含まれる音声情報の特徴量解析の結果として出力される音声特徴量とＰＰＧ変換モデルとに基づいて、ＰＰＧを生成し、生成したＰＰＧと音声パラメータ生成モデルに基づいて、音声パラメータを生成し、音声パラメータによる音声の波形を生成して、出力音声として出力するようにしたものである。 The configuration of the voice quality conversion system of the present invention is preferably a voice quality conversion system that outputs a synthetic voice obtained by converting voice quality from input voice by an information processing device, and comprises a voice quality conversion data creation device and a voice quality conversion device. The voice quality conversion data creation device includes a PPG (speech post-probability) model learning unit that generates a PPG conversion model by inputting a word dictionary, a voice corpus that associates text with voice information, and a voice corpus and PPG. It consists of a voice parameter generation model learning unit that inputs a conversion model and generates a voice parameter generation model. The PPG conversion model learning unit generates a dictionary with rhyme information from a word dictionary and morphologically analyzes the text contained in the voice corpus. Then, based on the result of the morphological analysis and the dictionary with the rhyme information, the phonetic sequence with the rhyme information is generated, and the phonetic sequence with the rhyme information and the feature quantity analysis of the speech information contained in the speech corpus are output. An acoustic model is generated from the voice feature amount, the sound model is learned, and a PPG conversion model is generated. The voice parameter generation model learning unit uses the PPG conversion model and the voice corpus to generate a PPG corresponding to the PPG conversion model. Generates, extracts voice parameters from voice information contained in the voice corpus, learns the generated PPG and voice parameters, generates a voice parameter generation model, and the voice quality conversion device is an input voice and PPG conversion model learning unit. Input the PPG conversion model generated by the voice parameter generation model and the voice parameter generation model generated by the learning unit, and the voice feature amount and the PPG conversion model output as a result of the feature amount analysis of the voice information included in the voice corpus. Based on the above, a PPG is generated, a voice parameter is generated based on the generated PPG and a voice parameter generation model, a voice waveform based on the voice parameter is generated, and the voice is output as an output voice.

本発明によれば、音声の声質変換を行うにあたって、安定して高い音質の音質変換を可能にする声質変換システムおよび声質変換方法を提供することができる。 According to the present invention, it is possible to provide a voice quality conversion system and a voice quality conversion method that enable stable and high-quality sound quality conversion in performing voice quality conversion of voice.

声質変換システムの構成とデータフローを示した図である。It is a figure which showed the structure and data flow of a voice quality conversion system. 声質変換装置のハードウェア構成図である。It is a hardware block diagram of the voice quality conversion apparatus. クライアント・サーバシステムからなる声質変換システムのハードウェア構成図である。It is a hardware configuration diagram of a voice quality conversion system consisting of a client-server system. 声質変換用データ作成装置の機能構成とデータフローを示す図である。It is a figure which shows the functional structure and data flow of the data creation apparatus for voice quality conversion. ＰＰＧ変換モデル学習部の機能構成とデータフローを示す図である。It is a figure which shows the functional structure and data flow of the PPG conversion model learning part. テキスト「これは橋です。」の形態素解析の結果を示す表である。It is a table showing the result of morphological analysis of the text "This is a bridge." テキスト「これは箸です。」の形態素解析の結果を示す表である。It is a table showing the result of morphological analysis of the text "This is chopsticks." 音声モデル学習部の機能構成とデータフローを示す図である。It is a figure which shows the functional structure and data flow of a voice model learning part. 声質変換装置の機能構成とデータフローを示す図である。It is a figure which shows the functional structure and data flow of a voice quality conversion apparatus. 敵対的生成ネットワーク学習を用いた音声特徴量学習システムの処理とデータフローを示す図である。It is a figure which shows the processing and data flow of the speech feature learning system using the hostile generation network learning. 声質変換により変換された音声コーパスを用いたマルチリンガル音声合成システムを構築する処理の流れとデータフローを示す図である。It is a figure which shows the process flow and data flow of constructing the multilingual speech synthesis system using the speech corpus converted by the speech quality conversion. 声質変換により変換された音声コーパスを用いて、入力テキストに対する音声合成をするシステムを構築する処理の流れとデータフローを示す図である。It is a figure which shows the process flow and data flow which constructs the system which synthesizes speech with respect to the input text using the speech corpus converted by the speech quality conversion.

以下、本発明に係る各実施形態を、図１ないし図１１を用いて説明する。 Hereinafter, each embodiment of the present invention will be described with reference to FIGS. 1 to 11.

〔実施形態１〕
先ず、図１および図３を用いて声質変換システムの構成を説明する。
一般的な声質変換システムは、図１に示されるように、声質変換用データ作成装置２００と、声質変換装置１００により構成されている。声質変換用データ作成装置２００は、音声コーパス１０から声質変換用データ２０を生成装置する装置である。声質変換装置１００は、その声質変換用データ２０を用いて、入力音声３０から所望の声質を有する合成音声４０に変換して出力する装置である。 [Embodiment 1]
First, the configuration of the voice quality conversion system will be described with reference to FIGS. 1 and 3.
As shown in FIG. 1, a general voice quality conversion system includes a voice quality conversion data creation device 200 and a voice quality conversion device 100. The voice quality conversion data creation device 200 is a device that generates voice quality conversion data 20 from the voice corpus 10. The voice quality conversion device 100 is a device that converts the input voice 30 into a synthetic voice 40 having a desired voice quality and outputs the voice quality conversion data 20 using the voice quality conversion data 20.

音声コーパス１０は、音声ファイルとテキストを対応付けたデータである。声質変換用データ２０は、ＰＰＧ変換モデル１〜ＰＰＧ変換モデルＮと、音声パラメータ生成モデルである（詳細は後述）。 The voice corpus 10 is data in which a voice file and a text are associated with each other. The voice quality conversion data 20 is a PPG conversion model 1 to a PPG conversion model N and a voice parameter generation model (details will be described later).

以下では、各装置の機能構成とそれによる処理を主体として述べるが、それらの機能構成部は、ハードウェアとして実現してもよいし、ソフトウェアプログラムとして実現されてもよい。 In the following, the functional configuration of each device and the processing by the functional configuration will be mainly described, but the functional configuration unit may be realized as hardware or as a software program.

また、以下の説明では、学習時に日本語音声コーパスを例にしているが、ほかの自然言語、あるいは、複数言語の混じっている音声コーパスも用いても処理可能である。ただし、その場合は、その言語に対応したプログラム・データを用いなければならない。 Further, in the following explanation, the Japanese voice corpus is taken as an example at the time of learning, but processing can also be performed by using another natural language or a voice corpus in which a plurality of languages are mixed. However, in that case, the program data corresponding to the language must be used.

さらに、以下の説明では、声質変換手法として、例えば、ＤＮＮ（Deep Neural Network:深層ニューラルネットワーク）を用いることを前提にして説明しているが、他の統計ベース手法を用いてもよい。 Further, in the following description, for example, DNN (Deep Neural Network) is used as the voice quality conversion method, but other statistics-based methods may be used.

次に、図２を用いて声質変換装置のハードウェア構成について説明する。
声質変換装置１００は、一般的な情報処理装置で実現でき、図２に示されるように、例えば、補助記憶装置１０１と、音声入力Ｉ／Ｆ（InterFace）１０２と、ＣＰＵ１０３と、主メモリ１０４と、音声出力Ｉ／Ｆ１０５とからなり、それらがバス１０７により接続された形態である。 Next, the hardware configuration of the voice quality conversion device will be described with reference to FIG.
The voice quality conversion device 100 can be realized by a general information processing device, and as shown in FIG. 2, for example, an auxiliary storage device 101, a voice input I / F (InterFace) 102, a CPU 103, and a main memory 104. , Audio output I / F 105, and they are connected by a bus 107.

ＣＰＵ１０３は、声質変換装置１００の各部を制御し、主記憶装置１０４に必要なプログラムをロードして実行する。
主メモリ１０４は、通常、ＲＡＭなどの揮発メモリで構成され、ＣＰＵ１０２が実行するプログラム、参照するデータが記憶される。 The CPU 103 controls each part of the voice quality conversion device 100, loads a program required for the main storage device 104, and executes the program.
The main memory 104 is usually composed of a volatile memory such as a RAM, and stores a program executed by the CPU 102 and data to be referred to.

音声入力Ｉ／Ｆ１０２は、マイクなどと接続されて、音声信号を入力するためのインターフェースである。
音声出力Ｉ／Ｆ１０３は、スピーカなどと接続されて、音声信号を入力するためのインターフェースである。 The voice input I / F 102 is an interface for inputting a voice signal by being connected to a microphone or the like.
The audio output I / F 103 is an interface for inputting an audio signal by being connected to a speaker or the like.

音声の入出力は、例えば、ＷＡＶＥファイルやＭＰ３ファイルのようにコード化された音声データを入出力するようにしてもよい。 For audio input / output, for example, encoded audio data such as a WAVE file or an MP3 file may be input / output.

補助記憶装置１０１は、ＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）などの大容量の記憶容量を有する記憶装置である。
補助記憶装置１０１には、図示しなかったが、本実施形態の声質変換装置１００の機能を実行するためのプログラムである特徴量解析プログラム、ＰＰＧ抽出プログラム、マージプログラム、音声パラメータ生成プログラム、波形生成プログラムがインストールされている。 The auxiliary storage device 101 is a storage device having a large storage capacity such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive).
Although not shown in the auxiliary storage device 101, a feature quantity analysis program, a PPG extraction program, a merge program, a voice parameter generation program, and a waveform generation program, which are programs for executing the functions of the voice quality conversion device 100 of the present embodiment, are not shown. The program is installed.

特徴量解析プログラム、ＰＰＧ抽出プログラム、マージプログラム、音声パラメータ生成プログラム、波形生成プログラムは、それぞれ特徴量解析部、ＰＰＧ抽出部、マージ部、音声パラメータ生成部、波形生成部の機能を実行するプログラムである。なお、これらの機能部の処理の詳細については、後述する。 The feature quantity analysis program, PPG extraction program, merge program, voice parameter generation program, and waveform generation program are programs that execute the functions of the feature quantity analysis section, PPG extraction section, merge section, voice parameter generation section, and waveform generation section, respectively. be. The details of the processing of these functional units will be described later.

また、補助記憶装置１０１には、声質変換装置１００で使用される各種データが格納される。声質変換装置１００で使用される各種データには、後述するように、音声特徴量、ＰＰＧ、音声パラメータ生成モデル、音声パラメータがある。 Further, the auxiliary storage device 101 stores various data used in the voice quality conversion device 100. Various data used in the voice quality conversion device 100 include voice features, PPG, voice parameter generation model, and voice parameters, as will be described later.

同様に、声質変換用データ作成装置２００も、図２の声質変換装置１００と同様の構成を有する情報処理装置で実現することができる。 Similarly, the voice quality conversion data creation device 200 can also be realized by an information processing device having the same configuration as the voice quality conversion device 100 of FIG.

声質変換用データ作成装置２００の補助記憶装置１０１には、図示しなかったが、本実施形態の声質変換用データ作成装置２００の機能を実行するためのプログラムであるＰＰＧ変換モデル学習プログラム、音声パラメータ生成モデル学習プログラムがインストールされている。 Although not shown in the auxiliary storage device 101 of the voice quality conversion data creation device 200, the PPG conversion model learning program and the voice parameters, which are programs for executing the functions of the voice quality conversion data creation device 200 of the present embodiment, are shown in the auxiliary storage device 101. The generative model learning program is installed.

ＰＰＧ変換モデル学習プログラム、音声パラメータ生成モデル学習プログラムは、それぞれＰＰＧ変換モデル学習部、音声パラメータ生成モデル学習部の機能を実行するプログラムである。なお、これらの機能部の処理の詳細については、後述する。 The PPG conversion model learning program and the voice parameter generation model learning program are programs that execute the functions of the PPG conversion model learning unit and the voice parameter generation model learning unit, respectively. The details of the processing of these functional units will be described later.

また、補助記憶装置１０１には、声質変換用データ作成装置２００で使用される各種データが格納される。声質変換用データ作成装置２００で使用される各種データには、後述するように、単語辞書、音声コーパス、音声パラメータ生成モデルがある。 Further, the auxiliary storage device 101 stores various data used in the voice quality conversion data creation device 200. Various data used in the voice quality conversion data creation device 200 include a word dictionary, a voice corpus, and a voice parameter generation model, as will be described later.

声質変換装置１００は、例えば、カーナビゲーション装置、携帯電話機、パーソナルコンピュータ等のデバイスに、声質変換ユニットとして組み込まれている。そのため、図２に示した各ハードウェアは、声質変換装置１００が組み込まれたデバイスにより実現してもよいし、声質変換用データ作成装置２００と声質変換装置１００が組み込まれたデバイスとは別個に設けられていてもよい。 The voice quality conversion device 100 is incorporated as a voice quality conversion unit in a device such as a car navigation device, a mobile phone, or a personal computer. Therefore, each hardware shown in FIG. 2 may be realized by a device in which the voice quality conversion device 100 is incorporated, or separately from the voice quality conversion data creation device 200 and the device in which the voice quality conversion device 100 is incorporated. It may be provided.

声質変換に関するすべての機能を一つあるいは二つのデバイスだけで実現してもよいが、図３に示す変形例のように、サーバ３３０とクライアント端末４００（図３では、４００Ａ、４００Ｂと表記）が、ネットワーク５により相互接続されたシステムでも実現することができる。 All the functions related to voice quality conversion may be realized by only one or two devices, but as shown in the modified example shown in FIG. 3, the server 330 and the client terminal 400 (denoted as 400A and 400B in FIG. 3) , It can also be realized in a system interconnected by the network 5.

この場合には、クライアント端末４００は、音声入力Ｉ／Ｆ４０２、音声出力Ｉ／Ｆ４０５と、通信Ｉ／Ｆ４０６を有し、クライアント端末４００側で音声を受付け、声質変換に関する機能の一部または全部をサーバ３００側で担当し、必要なデータをサーバ３００側の通信Ｉ／Ｆ３０６とクライアント端末４００の通信Ｉ／Ｆ４０５でやりとりするようにしてもよい。 In this case, the client terminal 400 has a voice input I / F 402, a voice output I / F 405, and a communication I / F 406, receives voice on the client terminal 400 side, and performs some or all of the functions related to voice quality conversion. The server 300 may be in charge, and necessary data may be exchanged between the communication I / F 306 on the server 300 side and the communication I / F 405 on the client terminal 400.

次に、図４ないし図８を用いて実施形態１に係る声質変換システムの機能と処理について説明する。
先ず、図４を用いて声質変換データ装置の機能構成とデータフローについて説明する。
声質変換用データ作成装置２００は、図４に示されるように、機能構成として、ＰＰＧ変換モデル学習部２１０と音声パラメータ生成モデル学習部２２０を有している。 Next, the functions and processing of the voice quality conversion system according to the first embodiment will be described with reference to FIGS. 4 to 8.
First, the functional configuration and data flow of the voice quality conversion data device will be described with reference to FIG.
As shown in FIG. 4, the voice quality conversion data creation device 200 has a PPG conversion model learning unit 210 and a voice parameter generation model learning unit 220 as functional configurations.

ＰＰＧ変換モデル学習部２１０は、単語辞書５００と音声コーパス５１０を用いた学習により、ＰＰＧ変換モデル（詳細は後述）７００を生成する機能部である。 The PPG conversion model learning unit 210 is a functional unit that generates a PPG conversion model (details will be described later) 700 by learning using the word dictionary 500 and the voice corpus 510.

ＰＰＧ（音素事後確率）とは、非特許文献１に定義されているように、ある発話におけるそれぞれの音素クラスに対する事後確率（非特許文献１では、時間−音素クラスの事後確率の表現行列）である。 PPG (phoneme posterior probability) is, as defined in Non-Patent Document 1, the posterior probability for each phoneme class in a certain utterance (in Non-Patent Document 1, the expression matrix of the time-phoneme class posterior probability). be.

音声パラメータ生成モデル学習部２２０は、生成用音声コープス（音声コープス５１０とデータ構造は同じ）５２０とＰＰＧ変換モデル６００より、音声パラメータ生成モデルを生成する機能部である。 The voice parameter generation model learning unit 220 is a functional unit that generates a voice parameter generation model from the generation voice corps (same data structure as the voice corps 510) 520 and the PPG conversion model 600.

次に、図５を用いてＰＰＧ変換モデル学習部の機能とデータフローの詳細について説明する。
ＰＰＧ変換モデル学習部２１０では、上述のように単語辞書５００と音声コーパス５１０を用いた学習により、ＰＰＧ変換モデル７００を生成する。このＰＰＧ変換モデル７００は、入力音声３０から発話内容と発話スタイル情報を含むＰＰＧに変換するモデルである。このＰＰＧ変換モデル学習部２１０は、図５に示されるように、形態素解析部２１１、言語モデル学習部２１２、特徴量解析部２１３、辞書読み拡張部２１４、形態素配列拡張部２１５、音素配列＆音響モデル学習部２１６、言語モデル考慮音響モデル学習部２１７のサブ機能部により構成されている。 Next, the function of the PPG conversion model learning unit and the details of the data flow will be described with reference to FIG.
The PPG conversion model learning unit 210 generates the PPG conversion model 700 by learning using the word dictionary 500 and the voice corpus 510 as described above. This PPG conversion model 700 is a model that converts the input voice 30 into a PPG including the utterance content and the utterance style information. As shown in FIG. 5, the PPG conversion model learning unit 210 includes a morphological analysis unit 211, a language model learning unit 212, a feature quantity analysis unit 213, a dictionary reading extension unit 214, a morpheme arrangement expansion unit 215, a phoneme arrangement & sound. It is composed of a sub-functional unit of a model learning unit 216 and a language model-considered acoustic model learning unit 217.

形態素解析部２１１、言語モデル学習部２１２、特徴量解析部２１３、辞書読み拡張部２１４、形態素配列拡張部２１５、音素配列＆音響モデル学習部２１６、言語モデル考慮音響モデル学習部２１７は、それぞれ、ＰＰＧ変換モデル学習プログラムのサブルーチンとして、形態素解析プログラム、言語モデル学習プログラム、特徴量解析プログラム、辞書読み拡張プログラム、形態素配列拡張プログラム、音素配列＆音響モデル学習プログラム、言語モデル考慮音響モデル学習プログラムを実行することにより実現することができる。 The morphological analysis unit 211, the language model learning unit 212, the feature quantity analysis unit 213, the dictionary reading extension unit 214, the morphological element arrangement expansion unit 215, the phonetic array & acoustic model learning unit 216, and the language model-considered acoustic model learning unit 217 are respectively. Execute morphological analysis program, language model learning program, feature quantity analysis program, dictionary reading extension program, morphological sequence extension program, phonetic array & acoustic model learning program, language model-considered acoustic model learning program as subroutines of PPG conversion model learning program. It can be realized by doing.

形態素解析部２１１は、事前に用意した単語辞書５００を用いて、テキストを形態素単位に分割する機能部である。ここで、形態素とは、言語学上で意味を有する最小の表現単位である。この形態素解析部２１１の機能を実現するために、一般的に使われているＭｅＣａｂや茶筌などのＯＳＳ（Open Source Software）の形態素解析ツールを利用することができる。 The morphological analysis unit 211 is a functional unit that divides text into morpheme units using a word dictionary 500 prepared in advance. Here, the morpheme is the smallest unit of expression having meaning in linguistics. In order to realize the function of the morphological analysis unit 211, OSS (Open Source Software) morphological analysis tools such as MeCab and chasen, which are generally used, can be used.

単語辞書５００には、必ずその言語における読みが用意してあるものとする。そして、形態素解析部２１１により、入力したテキストに対して、読み情報付き形態素配列６００が生成される。 It is assumed that the word dictionary 500 always has a reading in that language. Then, the morphological analysis unit 211 generates a morpheme array 600 with reading information for the input text.

なお、本実施形態の説明で、形態素解析部２１１に「単語辞書」を入力するとしたが、辞書の単位は必ず単語ではなく、フレーズや文でもよい。 In the description of this embodiment, it is assumed that the "word dictionary" is input to the morphological analysis unit 211, but the unit of the dictionary is not necessarily a word but a phrase or a sentence.

ここで、一例を示すと、テキスト「これは箸です。」に対して、形態素解析した結果は、図６Ａに示されるようになり、一方、テキスト「これは橋です。」に対して、形態素解析した結果は、図６Ｂに示されるようになる。 Here, to give an example, the result of morphological analysis for the text "This is chopsticks" is shown in Fig. 6A, while the text "This is a bridge" is morphologically analyzed. The result of the analysis is shown in FIG. 6B.

言語モデル学習部２１２は、形態素解析部２１１から生成された読み情報付き形態素配列６００を用いて、言語モデル学習を行い、言語モデルを作成する機能部である。言語モデル学習では、一般的にＮ−ｇｒａｍと呼ばれる言語モデルが使われることが多い。Ｎ−ｇｒａｍとは、任意の文字列や文書を、Ｎ個の連続した文字で分割する手法である。なお、近年、ＲＮＮ（Recurrent Neural Network：再帰型ニューラルネットワーク）を用いた言語モデルなども使われるようになっている。 The language model learning unit 212 is a functional unit that performs language model learning and creates a language model using the morpheme array 600 with reading information generated from the morphological analysis unit 211. In language model learning, a language model generally called N-gram is often used. N-gram is a method of dividing an arbitrary character string or document into N consecutive characters. In recent years, a language model using an RNN (Recurrent Neural Network) has also come to be used.

特徴量解析部２１３は、音声コーパス５２０に含まれている音声から、特徴量を抽出する機能部である。音声コーパス５２０は、図５に示されるように、発話テキスト５２１と音声５２２を一対一に対応付けたデータである。音声の特徴量としては、一般的に、ＭＦＣＣがよく使われているが、一部、ＬＦ０などの韻律情報を用いる研究も存在する。ＭＦＣＣ（Mel Frequency Ceastral Coefficient：メル周波数ケプストラム係数）は、対数ケプストラム（声道成分に由来した周波数特性を表現する）の低次成分に対して、ヒトの周波数知覚特性を考慮した重み付けをした特徴量である。ＬＦ０は、基本周波数Ｆ０の対数である。 The feature amount analysis unit 213 is a functional unit that extracts a feature amount from the voice included in the voice corpus 520. As shown in FIG. 5, the voice corpus 520 is data in which the utterance text 521 and the voice 522 are associated one-to-one. Generally, MFCC is often used as a feature of speech, but there are some studies using prosodic information such as LF0. MFCC (Mel Frequency Ceastral Coefficient) is a feature quantity in which the low-order components of logarithmic cepstrum (expressing frequency characteristics derived from voice tract components) are weighted in consideration of human frequency perception characteristics. Is. LF0 is the logarithm of the fundamental frequency F0.

本実施形態で用いる特徴量は、どのような特徴量を用いてもよいが、最低限、調音情報と韻律情報を含まれている必要がある。すなわち、ＭＦＣＣを用いる場合は、低次元のみを用いる場合は、韻律情報が含まれていないため、全次元（１６ｋＨｚの音声の場合は、全４０次元）を用いることが推奨される。 Any feature amount may be used as the feature amount used in the present embodiment, but at a minimum, it is necessary to include tone information and prosodic information. That is, when MFCC is used, it is recommended to use all dimensions (40 dimensions in the case of 16 kHz voice) because prosodic information is not included when only low dimensions are used.

ここで、調音（articulation）とは、喉頭以上の器官の形や動きによって発声器官内の空気の流れを制御したり、発声器官内で発生する音声の共鳴の仕方を変化させたり、新たな音を発生あるいは追加したりして、さまざまな母音や子音を発生させることである。また、韻律(prosody)とは、発話において現れる音声学的性質で、抑揚あるいは音調、強勢、音長、リズムなどのその言語の一般的な書記記録からは予測されないものをいう。 Here, articulation is a new sound that controls the flow of air in the vocal organs by the shape and movement of organs above the larynx, changes the way the voices that are generated in the vocal organs resonate, and so on. To generate or add various vowels and consonants. Prosody is a phonetic property that appears in speech and is not predicted from general clerk records of the language, such as intonation or tone, emphatic consonant, length, and rhythm.

辞書読み拡張部２１４は、事前に用意されている形態素解析用の単語辞書５００に付与されている読み情報（音素情報）に対して、韻律シンボルを加えて、韻律情報付き音素に拡張し、韻律情報付き音素辞書６０２を生成する。なお、日本語の場合は、読み情報として音節を与えられることもあるので、以降単に「音素」と書いた場合でも、「音節」を指すこともあるものとする。ここで、言語学において、音素（phoneme）とは、ある個別言語の中で、同じとみなされる音の集まりをいい、音節（syllable）とは、連続する言語音を区切る文節単位の一種である。 The dictionary reading extension unit 214 adds a prosody symbol to the reading information (phoneme information) given to the word dictionary 500 for morphological analysis prepared in advance, expands it into a phoneme with prosody information, and prosody. Generates a phoneme dictionary 602 with information. In the case of Japanese, syllables may be given as reading information, so even if the term "phoneme" is simply used, it may refer to "syllables". Here, in linguistics, a phoneme is a group of sounds that are considered to be the same in a certain individual language, and a syllable is a kind of syllable unit that separates continuous speech sounds. ..

この韻律情報付き音素は、各言語の特徴に合わせる必要があり、言語情報を担う韻律情報を定義することが必要である。例えば、日本語のような高低アクセント言語では、音節間Ｆ０の相対位置がアクセントの区別に重要な手がかりとなっているため、すべての母音にＨｉｇｈＰｉｔｃｈを意味する「Ｈ」とＬｏｗＰｉｔｃｈを意味する「Ｌ」をつけることができる。一方では、中国語のような声調（tone）言語では、音節内のＦ０パターンが意味の理解に重要な役割を果たしているため、母音音素に４つの声調シンボルとして（いわゆる普通話の場合）、数字１〜５（軽声、第１声〜第４声）をつけることができる。さらに、アクセントの変形（中国語では変調）のことを考慮し、同じ単語に対しても、複数の韻律パターンを登録することにより、実際に音声の韻律変化を正確にとらえることができる。 This phoneme with prosody information needs to be matched to the characteristics of each language, and it is necessary to define the prosodic information that bears the linguistic information. For example, in a pitch-accent language such as Japanese, the relative position of F0 between syllables is an important clue to distinguish accents, so all vowels mean "H", which means High Pitch, and Low Pitch. "L" can be added. On the other hand, in a tone language such as Chinese, the F0 pattern in a syllable plays an important role in understanding the meaning, so the number 1 is used as four tone symbols for vowel phonemes (in the case of so-called putonghua). ~ 5 (light voice, 1st to 4th voices) can be added. Furthermore, by registering a plurality of prosodic patterns for the same word in consideration of the deformation of the accent (modulation in Chinese), it is possible to actually accurately capture the prosodic change of the voice.

一例としては、単語「橋」に対して、拡張前は、「表記＝橋；読み＝／ハ／＋／シ／」となっているとして、拡張後は、上記の日本語の場合の韻律シンボルを付加し、「表記＝橋；読み１＝／ハＬ／＋／シＨ／：読み２＝／ハＨ／＋／シＨ／」に拡張し、すべての話しうるアクセント型をリストする。一方、単語「箸」に対しては、拡張前は「表記＝箸；読み＝／ハ／＋／シ／」となっていることに対して、拡張後は「表記＝箸；読み１＝／ハＨ／＋／シＬ／：読み２＝／ハＨ／＋／シしＨ／」に拡張する。 As an example, for the word "bridge", before expansion, it is assumed that "notation = bridge; reading = / ha / + / shi /", and after expansion, the prosodic symbol in the above Japanese case. Is added and expanded to "Notation = Bridge; Reading 1 = / Ha L / + / Shi H /: Reading 2 = / Ha H / + / Shi H /" to list all the accent types that can be spoken. On the other hand, for the word "chopsticks", "notation = chopsticks; reading = / ha / + / shi /" before expansion, whereas "notation = chopsticks; reading 1 = /" after expansion. C H / + / C Chopsticks L /: Reading 2 = / C Chopsticks / + / Chopsticks H / ”.

すなわち、従来では、単語辞書５００から音素配列を生成するのみであったが、本実施形態では、韻律情報付き辞書６０２により、韻律情報付き音素配列６０３を生成する。そのため、従来では音素配列だけでは一意に特定できない同音異義語に対しても、アクセントの違いによって、特定することができるようになる。すなわち、韻律情報付き音素を導入することにより、音声認識時に韻律情報を考慮することとなり、音響モデルの出力であるＰＰＧには韻律情報が含まれることになる。 That is, conventionally, only the phoneme sequence is generated from the word dictionary 500, but in the present embodiment, the phoneme sequence 603 with prosody information is generated by the dictionary 602 with prosody information. Therefore, even homonyms that cannot be uniquely identified by phoneme arrangements in the past can be identified by the difference in accent. That is, by introducing a phoneme with prosody information, the prosody information is taken into consideration at the time of speech recognition, and the PPG which is the output of the acoustic model includes the prosody information.

形態素配列拡張部２１５は、形態素の読みを複数に展開し、すべての読みうるパターンを用意し、韻律情報付き音素配列６０３に変換する機能部である。 The morpheme array expansion unit 215 is a functional unit that expands morpheme readings into a plurality of morphemes, prepares all readable patterns, and converts them into phoneme sequences 603 with prosody information.

例えば、「これは橋です。」に対して、「／コＬ／＋／レＨ／＋／ワＨ／＋／ハＬ／＋／シＨ／＋／デＨ／＋／スＬ／」や「／コＬ／＋／レＨ／＋／ワＨ／＋／ハＨ／＋／シＨ／＋／デＨ／＋／スＬ／」に展開される。音素配列の数は、各単語に登録されている全読み数の組み合わせとなる。 For example, in contrast to "This is a bridge." It is expanded to "/ co L / + / re H / + / wa H / + / ha H / + / shi H / + / de H / + / sl /". The number of phoneme sequences is a combination of the total number of readings registered for each word.

音素配列決定＆音響モデル学習部２１６は、形態素配列拡張部２１５が生成した複数の韻律情報付き音素配列６０３から、最も確率の高い組み合わせを決定したうえ、各音素の特徴（音声特徴量であるＭＦＣＣの平均と分散）を計算し、音響モデル６２０を生成する機能部である。一般的に、最適系列の決定にＨＭＭ（Hidden Markov Model：隠れマルコフモデル）がよく使われているが、音響モデルの学習では、ＤＮＮを用いることが主流となっている。 The phoneme arrangement determination & acoustic model learning unit 216 determines the most probable combination from the phoneme arrangements 603 with a plurality of prosodic information generated by the morpheme arrangement extension unit 215, and then determines the characteristics of each phoneme (MFCC, which is a voice feature amount). It is a functional part that calculates the average and variance of) and generates an acoustic model 620. In general, HMM (Hidden Markov Model) is often used to determine the optimum series, but DNN is the mainstream for learning acoustic models.

言語モデル考慮音響モデル学習部２１３は、言語モデル学習部２１２が生成した言語モデル６１０と、音素配列決定＆音響モデル学習部２１０が生成した音響モデル６２０を用いて、音声コーパス５２０に対してエラー率最小化の基準で再学習を行い、ＰＰＧ変換モデル７００を生成する機能部である。このように学習したＰＰＧ変換モデル７００は、言語情報の伝達に必要な韻律情報を表現できるため、言語の特徴によって、表現できる韻律情報が異なる。 The language model consideration acoustic model learning unit 213 uses the language model 610 generated by the language model learning unit 212 and the acoustic model 620 generated by the phoneme arrangement determination & acoustic model learning unit 210, and has an error rate with respect to the voice corpus 520. It is a functional part that relearns based on the standard of minimization and generates a PPG conversion model 700. Since the PPG conversion model 700 learned in this way can express the prosodic information necessary for transmitting the linguistic information, the prosodic information that can be expressed differs depending on the characteristics of the language.

例えば、高低アクセント言語（音節間のＦ０相対位置が単語の区別に寄与する言語）である日本語なら広域（複数シラブルにまたいだ範囲）のＦ０変動、声調言語（音節内のＦ０パターンの形状の違いが単語の区別に寄与する言語）である中国語なら局所的な（音節内の）Ｆ０変動をとらえることができる。それに対して、強弱アクセント言語である英語では音の強弱を表現することができると考えられる。 For example, in the case of Japanese, which is a pitch-accent language (a language in which the relative position of F0 between syllables contributes to the distinction of words), F0 fluctuation over a wide area (a range that spans multiple syllables) and tonal language (the shape of the F0 pattern within a syllable) In Chinese, where differences contribute to the distinction of words), local (intrasyllable) F0 fluctuations can be captured. On the other hand, in English, which is a strong and weak accent language, it is thought that the strength of sound can be expressed.

次に、図７を用いて音声パラメータ生成モデル学習部の機能とデータフローの詳細について説明する。
音声パラメータ生成モデル学習部２２０では、上述のように、生成用音声コープス５２０とＰＰＧ変換モデル７００より、音声パラメータ生成モデル１０００を生成する機能部である。 Next, the function of the voice parameter generation model learning unit and the details of the data flow will be described with reference to FIG. 7.
As described above, the voice parameter generation model learning unit 220 is a functional unit that generates the voice parameter generation model 1000 from the generation voice corps 520 and the PPG conversion model 700.

音声パラメータ生成モデル学習部２２０は、図７に示されるように、ＰＰＧ抽出部２２１、ＰＰＧマージ部２２２、音声パラメータ抽出部２２３、音声モデル学習部２２４、特徴量解析部２２５のサブ機能部で構成されている。 As shown in FIG. 7, the voice parameter generation model learning unit 220 includes a PPG extraction unit 221 and a PPG merge unit 222, a voice parameter extraction unit 223, a voice model learning unit 224, and a sub-function unit of the feature amount analysis unit 225. Has been done.

ＰＰＧ抽出部２２１、ＰＰＧマージ部２２２、音声パラメータ抽出部２２３、モデル学習部２２４、特徴量解析部２２５は、それぞれ、音声パラメータ生成モデル学習プログラムのサブルーチンとして、ＰＰＧ抽出プログラム、ＰＰＧマージプログラム、音声パラメータ抽出プログラム、音声モデル学習プログラム、特徴量解析プログラムを実行することにより実現することができる。 The PPG extraction unit 221 and the PPG merge unit 222, the voice parameter extraction unit 223, the model learning unit 224, and the feature quantity analysis unit 225 are the subroutines of the voice parameter generation model learning program, respectively, the PPG extraction program, the PPG merge program, and the voice parameter. This can be achieved by executing an extraction program, a voice model learning program, and a feature quantity analysis program.

ＰＰＧ抽出部２２１は、図５で説明したＰＰＧ変換モデル学習部２１０で得られたＰＰＧ変換モデル７００（図７では、ＰＰＧ変換モデル１：７００−１〜ＰＰＧ変換モデルＮ：７００−Ｎと表記）を用いて、生成用音声コーパス５１１から特徴量解析部２２５により取り出された音声特徴量６４０に対して、ＰＰＧ８００（図７では、ＰＰＧ１：８００−１〜ＰＰＧＮ：８００−Ｎと表記）を抽出する機能部である。なお、特徴量解析部２２５は、図５に示した特徴量解析部２１３と同様である。ここで、複数のＰＰＧ変換モデルを用いることによって、正確な韻律表現が可能となる。具体的には、音節をまたいでゆっくり変化するＦ０の動きを表現できる日本語ＰＰＧと、音節内の局所的なＦ０変化をとらえられる中国語ＰＰＧとを組み合わせることにより、Ｆ０パターンを充実して表現することができる。すなわち、複数の特徴の異なる言語を組み合わせることによって、入力音声のどの特徴を出力音声に残したいのかを、デザインすることができる。この点で、本発明者の実証では、日本語、中国語、英語の３言語を用いることにより、発音の強弱や発話のイントネーションを精度よく再現できることを確認することができた。 The PPG extraction unit 221 is the PPG conversion model 700 obtained by the PPG conversion model learning unit 210 described with reference to FIG. 5 (in FIG. 7, it is expressed as PPG conversion model 1: 700-1 to PPG conversion model N: 700-N). Is used to extract PPG800 (indicated as PPG1: 800-1 to PPGN: 800-N in FIG. 7) with respect to the voice feature amount 640 extracted by the feature amount analysis unit 225 from the generation voice corpus 511. It is a functional part. The feature amount analysis unit 225 is the same as the feature amount analysis unit 213 shown in FIG. Here, by using a plurality of PPG conversion models, accurate prosodic expression becomes possible. Specifically, by combining a Japanese PPG that can express the movement of F0 that slowly changes across syllables and a Chinese PPG that can capture local F0 changes within a syllable, the F0 pattern is fully expressed. can do. That is, by combining a plurality of languages having different characteristics, it is possible to design which characteristic of the input voice is desired to be left in the output voice. In this respect, in the demonstration by the present inventor, it was confirmed that the strength of pronunciation and the intonation of utterance can be accurately reproduced by using the three languages of Japanese, Chinese, and English.

ＰＰＧマージ部２２２は、ＰＰＧ抽出部２２１から得られた複数のＰＰＧ８００を一つのベクトルにマージする機能部である。ここでは、単に複数のベクトルをつなげ合わせて、次元数の大きなベクトルにすることも考えられるが、ＡｕｔｏＥｎｃｏｄｅｒなどの次元圧縮技術を使って、小さいベクトルに圧縮することもできる。 The PPG merging unit 222 is a functional unit that merges a plurality of PPG 800s obtained from the PPG extraction unit 221 into one vector. Here, it is conceivable to simply connect a plurality of vectors to form a vector having a large number of dimensions, but it is also possible to compress the vector into a small vector by using a dimension compression technique such as AutoEncoder.

音声パラメータ抽出部２２３は、生成用音声コーパス５１１の音声から音声合成用の音声パラメータ２２３を抽出する。この部分は、一般的に、音声合成にも使われている技術であり、ＳｔｒａｉｇｈｔやＷｏｒｌｄなどのＯＳＳを利用すれば、高品質な合成音声を得ることができる。 The voice parameter extraction unit 223 extracts the voice parameter 223 for voice synthesis from the voice of the voice corpus 511 for generation. This part is a technique generally used for speech synthesis, and high-quality synthetic speech can be obtained by using OSS such as Straight or World.

音声モデル学習部２２４は、同じ音声から抽出され、ＰＰＧマージ部２２２によりマージされたＰＰＧ８００と音声パラメータ抽出部２２３が抽出した音声パラメータ９００に対して、変換用ＤＮＮの学習により、音声パラメータ生成モデル１０００を生成する。すなわち、入力がＰＰＧ８００と音声パラメータ９００であり、その出力として、入力した音声パラメータ９００の音声パラメータ生成モデル１０００が得られる。一般的に音声のような時系列信号に対しては、Ｂｉ−ＬＳＴＭ（Bidirectional Long Short Term Memory：双方向長期短期記憶）を用いたほうがより高い性能が得られる。 The voice model learning unit 224 uses the voice parameter generation model 1000 by learning the conversion DNN for the PPG 800 extracted from the same voice and merged by the PPG merge unit 222 and the voice parameter 900 extracted by the voice parameter extraction unit 223. To generate. That is, the input is PPG 800 and the voice parameter 900, and as the output, the voice parameter generation model 1000 of the input voice parameter 900 is obtained. Generally, for time-series signals such as voice, higher performance can be obtained by using Bi-LSTM (Bidirectional Long Short Term Memory).

次に、図８を用いて声質変換装置の機能とデータフローについて説明する。
声質変換装置１００は、図８に示されるように、特徴量解析部１１０、ＰＰＧ抽出部１１１、ＰＰＧマージ部１１２、音声パラメータ生成部１１３、波形生成部１１４を有している。 Next, the function and data flow of the voice quality conversion device will be described with reference to FIG.
As shown in FIG. 8, the voice quality conversion device 100 includes a feature amount analysis unit 110, a PPG extraction unit 111, a PPG merging unit 112, a voice parameter generation unit 113, and a waveform generation unit 114.

特徴量解析部１１０は、図５に示した声質変換用データ作成装置２００のＰＰＧ変換モデル学習部２１０の特徴量解析部２１３と同じ機能構成部である。ＰＰＧ抽出部１１１とＰＰＧマージ部１１２は、図７に示した声質変換用データ作成装置２００の音声パラメータ生成モデル学習部２１０のＰＰＧ抽出部２２１とＰＰＧマージ部２２２と、それぞれ同じ機能構成部である。 The feature amount analysis unit 110 is the same functional component unit as the feature amount analysis unit 213 of the PPG conversion model learning unit 210 of the voice quality conversion data creation device 200 shown in FIG. The PPG extraction unit 111 and the PPG merging unit 112 are the same functional components as the PPG extraction unit 221 and the PPG merging unit 222 of the voice parameter generation model learning unit 210 of the voice quality conversion data creation device 200 shown in FIG. 7, respectively. ..

音声パラメータ生成部１１３は、声質変換用データ作成装置２００の音声パラメータ生成モデル学習部２１０で得られた音声パラメータ生成モデル１０００と、入力音声３０とＰＰＧ変換モデル７００から得られたＰＰＧ８００を入力し、音声パラメータ９００を生成する。音声パラメータ９００は、例えば、音声の高さに相当する基本周波数、音色に相当するスペクトル包絡、有声音のかすれに相当する非周期性指標（Aperiodicity）がある。 The voice parameter generation unit 113 inputs the voice parameter generation model 1000 obtained by the voice parameter generation model learning unit 210 of the voice quality conversion data creation device 200, and the input voice 30 and the PPG 800 obtained from the PPG conversion model 700. Generate voice parameter 900. The voice parameter 900 has, for example, a fundamental frequency corresponding to the pitch of the voice, a spectral envelope corresponding to the timbre, and an aperiodicity index (Aperiodicity) corresponding to the faint voiced sound.

波形生成部（ボコーダーともいう）１１４は、生成された音声パラメータ９００を用いて、音声波形を生成し、変換した音声を出力音声４０として出力する。 The waveform generation unit (also referred to as a vocoder) 114 generates a voice waveform using the generated voice parameter 900, and outputs the converted voice as the output voice 40.

本実施形態の声質変換システムによれば、韻律情報を含んだＰＰＧ変換モデルによって、ＰＰＧを生成し、そのＰＰＧを用いて音声パラメータを生成する。したがって、その音声パラメータによった声質変換により、話者の言語特有の韻律が考慮され、安定して高い音質の音質変換が可能となる。 According to the voice quality conversion system of the present embodiment, PPG is generated by a PPG conversion model including prosodic information, and voice parameters are generated using the PPG. Therefore, the voice quality conversion based on the voice parameters takes into consideration the prosody peculiar to the speaker's language, and enables stable and high-quality sound conversion.

〔実施形態２〕
以下、本発明の実施形態２を、図９を用いて説明する。
本実施形態では、実施形態１に示した声質変換装置を用いた応用の一つとして、敵対的生成ネットワーク（ＧＡＮ：Generative adversarial network）学習を用いた音声特徴量学習システムの処理とそのデータフローを説明する。 [Embodiment 2]
Hereinafter, Embodiment 2 of the present invention will be described with reference to FIG.
In the present embodiment, as one of the applications using the voice quality conversion device shown in the first embodiment, the processing of the voice feature learning system using the hostile generative network (GAN) learning and its data flow are described. explain.

本実施形態の音声特徴量学習システムは、敵対的生成ネットワークに基づいた学習を行うものであり、ＧｅｎｅｒａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅＧｅｎｅｒａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅとＤｉｓｃｒｉｍｉｎａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅの二つの段階よりなる。 The voice feature learning system of the present embodiment performs learning based on a hostile generation network, and comprises two stages of a Generator Training Stage, a Generator Training Stage, and a Discriminator Training Stage.

本実施形態の音声特徴量学習システムは、異なる発話者から収録した多言語音声コーパス２０１０から、統一した声質の音声特徴量を抽出することができるシステムである。これにより最適化された音声特徴量を用いて、マルチリンガル音声合成システムを構築することができる。 The voice feature learning system of the present embodiment is a system capable of extracting voice features with a unified voice quality from a multilingual voice corpus 2010 recorded from different speakers. This makes it possible to construct a multilingual speech synthesis system using the optimized speech features.

先ず、音声特徴量学習システムでは、言語が異なり、声質も異なる多言語音声コーパス２０１０（図９では、音声コーパス２０１０−１，音声コーパス２０１０−２，…と表記）から、実施形態１の声質変換装置１００による声質変換処理１５０を実行し、各々の言語に対して、目標の声質Ｚを有する多言語音声コーパス２０６０（図９では、音声コーパス２０６０−１，音声コーパス２０６０−２，…と表記）を生成する。 First, in the voice feature amount learning system, the voice quality conversion of the first embodiment is performed from the multilingual voice corpus 2010 (indicated as voice corpus 2010-1, voice corpus 2010-2, ... In FIG. 9) having different languages and voice qualities. The voice quality conversion process 150 is executed by the device 100, and the multilingual voice corpus 2060 having the target voice quality Z for each language (indicated as voice corpus 2060-1, voice corpus 2060-2, ... In FIG. 9). To generate.

一方、音声特徴量学習システムでは、ＧｅｎｅｒａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅで、多言語音声コーパス２０１０から言語特徴量解析処理２００２により、言語特徴量２０２０を生成し、音声特徴量解析処理２００３により、収録音声の音声特徴量２０３０を生成する。 On the other hand, in the voice feature learning system, the Generator Training Stage generates the language feature 2020 from the multilingual voice corpus 2010 by the language feature analysis process 2002, and the voice feature analysis process 2003 generates the voice feature of the recorded voice. Generate 2030.

次に、その言語特徴量２０２０と収録音声の音声特徴量２０３０を入力してモデル学習処理２０００を行い、Ｇｅｎｅｒａｔｏｒ（生成ネットワーク）による処理２００１によって結果を出力する。Ｇｅｎｅｒａｔｏｒによる処理２００１では、ノイズを含んだデータによる合成音声特徴量２０４０を出力する。いわば、次に説明するＤｉｓｃｒｉｍｉｎａｔｏｒ（識別ネットワーク）をだますようなデータを生成する。 Next, the language feature amount 2020 and the voice feature amount 2030 of the recorded voice are input to perform the model learning process 2000, and the result is output by the process 2001 by the Generator (generation network). In the processing 2001 by the Generator, the synthetic speech feature amount 2040 based on the data including noise is output. So to speak, it generates data that deceives the Discriminator (identification network) described below.

また、ＤｉｓｃｒｉｍｉｎａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅでは、目標の声質Ｚを有する多言語音声コーパス２０６０から、言語特徴量解析処理２００６により、変換後音声の音声特徴量２０５０を生成する。そして、Ｇｅｎｅｒａｔｏｒ２００１が生成した合成音声の特徴量２０４０と変換後音声の音声特徴量２０５０を入力してモデル学習処理２００５を実行し、その結果をＤｉｓｃｒｉｍｉｎａｔｏｒに出力する。Ｄｉｓｃｒｉｍｉｎａｔｏｒによる処理２００４では、その真偽を識別し、真偽を判定するラベルを生成して、ＧｅｎｅｒａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅのモデル学習処理２０００にフィードバックする。これによって、Ｇｅｎｅｒａｔｏｒによる処理２００１によって、自然でかつ声質Ｚの話者の声質に近い音声特徴量を生成することができる。 Further, in the Discriminator Training Stage, the voice feature amount 2050 of the converted voice is generated from the multilingual voice corpus 2060 having the target voice quality Z by the language feature amount analysis process 2006. Then, the feature amount 2040 of the synthetic voice generated by the Generator 2001 and the voice feature amount 2050 of the converted voice are input to execute the model learning process 2005, and the result is output to the Discriminator. In the process 2004 by the Discriminator, the authenticity is identified, a label for determining the authenticity is generated, and the label is fed back to the model learning process 2000 of the Generator Training Stage. Thereby, the processing 2001 by the Generator can generate a voice feature amount that is natural and close to the voice quality of the speaker of voice quality Z.

すなわち、このＧｅｎｅｒａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅと、ＤｉｓｃｒｉｍｉｎａｔｏｒＴｒａｉｎｉｎｇＳｔａｇｅを反復し、お互いの学習処理をループ処理させることにより、高い音質を維持したターゲットの声質Ｚの合成音声の音声特徴量を得ることができ、その音声特徴量を用いて複数の言語をサポートするターゲット話者の多言語音声合成システムを構築可能となる。 That is, by repeating this Generator Training Stage and the Discriminator Training Stage and looping each other's learning processing, it is possible to obtain the voice feature amount of the synthetic voice of the target voice quality Z that maintains high sound quality, and that voice. It is possible to construct a multilingual speech synthesis system for target speakers that supports multiple languages using feature quantities.

本の実施形態においては、声質変換装置１００に用いるＰＰＧは、多言語音声コーパス２０１０に含まれる全ての言語のＰＰＧをマージするものを利用することが望ましい。例えば、日中英の３言語バイリンガル音声合成システムを構築する場合は、図８に示したＰＰＧ抽出部において、日本語ＰＰＧ、中国語ＰＰＧ、英語ＰＰＧを抽出することが望ましい。 In the embodiment of the book, it is desirable that the PPG used for the voice quality conversion device 100 is a PPG that merges PPGs of all languages included in the multilingual speech corpus 2010. For example, when constructing a Japanese-Chinese-English three-language bilingual speech synthesis system, it is desirable to extract Japanese PPG, Chinese PPG, and English PPG in the PPG extraction unit shown in FIG.

なお、本実施形態の説明には、入力とする多言語音声コーパス２０１０には複数言語と複数の声質を含まれているものとして説明したが、単一言語単一声質でもよい。その場合には、音声合成の声質カスタマイズの効果を得ることができる。 In the description of the present embodiment, it is assumed that the input multilingual voice corpus 2010 includes a plurality of languages and a plurality of voice qualities, but a single language single voice quality may be used. In that case, the effect of voice quality customization of voice synthesis can be obtained.

〔実施形態３〕
以下、本発明に係る実施形態３を、図１０を用いて説明する。
本実施形態では、実施形態１の声質変換装置１００を用いた応用の一つとして、マルチリンガル音声合成システムを生成する処理について説明する。
先ず、言語が異なり、声質も異なる多言語音声コーパス３０１０（図１０では、音声コーパス３０１０−１，音声コーパス３０１０−２，…と表記）から、実施形態１の声質変換装置１００による声質変換処理１５０を実行し、各々の言語に対して、目標の声質Ｚを有する多言語音声コーパス３０２０（図９では、音声コーパス３０２０−１，音声コーパス３０２０−２，…と表記）を生成する。 [Embodiment 3]
Hereinafter, the third embodiment according to the present invention will be described with reference to FIG.
In this embodiment, a process of generating a multilingual speech synthesis system will be described as one of the applications using the voice quality conversion device 100 of the first embodiment.
First, from the multilingual voice corpus 3010 (denoted as voice corpus 3010-1, voice corpus 3010-2, ... In FIG. 10) having different languages and voice qualities, the voice quality conversion process 150 by the voice quality conversion device 100 of the first embodiment 150. Is executed to generate a multilingual voice corpus 3020 (denoted as voice corpus 3020-1, voice corpus 3020-2, ... In FIG. 9) having a target voice quality Z for each language.

次に、多言語音声コーパス３０２０より、言語特徴量解析処理３００１により、言語特徴量３０３０を生成し、音声特徴量解析処理３００２により、収録音声の音声特徴量３０４０を生成する。
そして、それらを入力とする合成システム構築処理３０００により、マルチリンガル音声合成システム３０５０を構築する。 Next, from the multilingual voice corpus 3020, the language feature amount analysis process 3001 generates the language feature amount 3030, and the voice feature amount analysis process 3002 generates the voice feature amount 3040 of the recorded voice.
Then, the multilingual speech synthesis system 3050 is constructed by the synthesis system construction process 3000 that uses them as inputs.

この音声合成システムを構築処理では、合成システム構築処理３０００の合成手法によらず、音声合成システムを構築することができる。 In the construction process of this speech synthesis system, the speech synthesis system can be constructed regardless of the synthesis method of the synthesis system construction process 3000.

〔実施形態４〕
以下、本発明に係る実施形態４を、図１１を用いて説明する。
図１１は、声質変換により変換された音声コーパスを用いて、入力テキストに対する音声合成をするシステムを構築する処理の流れとデータフローを示す図である。
本実施形態の処理は、音声合成システム構築処理と音声合成処理の二段階よりなる。 [Embodiment 4]
Hereinafter, the fourth embodiment according to the present invention will be described with reference to FIG.
FIG. 11 is a diagram showing a process flow and a data flow for constructing a system for synthesizing speech with respect to input text using a speech corpus converted by voice quality conversion.
The process of this embodiment includes two stages of a speech synthesis system construction process and a speech synthesis process.

先ず、音声合成システム構築処理では、言語が異なり、声質も異なる多言語音声コーパス４０１０（図１０では、音声コーパス４０１０−１，音声コーパス４０１０−２，…と表記）から、言語特徴量解析処理４００１により、言語特徴量４０２０を生成し、音声特徴量解析処理４００２により、収録音声の音声特徴量４０３０を生成する。
そして、それらを入力とする合成システム構築処理４０００により、マルチリンガル音声合成システム５０００を構築する。 First, in the speech synthesis system construction process, the language feature analysis process 4001 is performed from the multilingual speech corpus 4010 (denoted as speech corpus 4010-1, voice corpus 4010-2, ... In FIG. 10) having different languages and voice qualities. The language feature amount 4020 is generated by the above, and the voice feature amount 4030 of the recorded voice is generated by the voice feature amount analysis process 4002.
Then, the multilingual speech synthesis system 5000 is constructed by the synthesis system construction process 4000 that uses them as inputs.

次に、音声合成処理では、音声合成システム５０００に入力テキスト５０１０を入力して、合成音声５０２０を生成し、それを入力して、実施形態１の声質変換装置１００の声質変換処理１５０により、目標とする声質Ｚの合成音声５０３０を出力する。 Next, in the speech synthesis processing, the input text 5010 is input to the speech synthesis system 5000, the synthetic speech 5020 is generated, and the synthetic speech 5020 is input, and the target is performed by the voice quality conversion process 150 of the voice quality conversion device 100 of the first embodiment. The synthetic voice 5030 of the voice quality Z to be output is output.

本実施形態では、言語が異なり、声質も異なる多言語音声コーパス４０１０から、入力テキストに対応した統一した声質Ｚの声質の合成音声を出力することができる。 In the present embodiment, it is possible to output a synthetic voice of voice quality Z having a unified voice quality corresponding to the input text from the multilingual voice corpus 4010 having different languages and voice qualities.

Claims

A voice quality conversion system that outputs synthetic voice that is obtained by converting voice quality from input voice using an information processing device.
A data creation device for voice quality conversion and
Equipped with a voice conversion device
The voice quality conversion data creation device is
A PPG (phoneme posterior probability) model learning unit that generates a PPG conversion model by inputting a word dictionary, a voice corpus that associates text with voice information, and
It consists of a voice corpus and a voice parameter generation model learning unit that inputs the PPG conversion model and generates a voice parameter generation model.
The PPG conversion model learning unit
A dictionary with prosodic information is generated from the word dictionary,
The text contained in the speech corpus is morphologically analyzed, and a phoneme sequence with prosody information is generated based on the result of the morphological analysis and the dictionary with prosody information.
An acoustic model is generated from the phoneme sequence with prosodic information and the speech features output as a result of the feature analysis of the speech information included in the speech corpus.
The acoustic model is trained to generate the PPG conversion model, and the PPG conversion model is generated.
The voice parameter generation model learning unit
From the PPG conversion model and the voice corpus, a PPG corresponding to the PPG conversion model is generated.
Voice parameters are extracted from the voice information included in the voice corpus, and
By learning the generated PPG and the voice parameter, a voice parameter generation model is generated.
The voice quality conversion device inputs the input voice, the PPG conversion model generated by the PPG conversion model learning unit, and the voice parameter generation model generated by the voice parameter generation model learning unit.
PPG is generated based on the voice feature amount output as a result of the feature amount analysis of the voice information included in the voice corpus and the PPG conversion model.
A voice parameter is generated based on the generated PPG and the voice parameter generation model.
A voice quality conversion system characterized in that a voice waveform based on the voice parameters is generated and output as output voice.

The voice quality conversion system according to claim 1, wherein there are a plurality of PPG conversion models generated by the PPG conversion model learning unit for each language.

The voice quality conversion system according to claim 2, wherein a prosodic symbol is added based on the characteristics of each language in generating the phoneme sequence with prosody information in the PPG conversion model learning unit.

The PPG conversion model learning unit morphologically analyzes the text of the voice corpus, performs language model learning, generates a language model, and then generates a language model.
The voice quality conversion system according to claim 1, wherein the PPG conversion model is generated by learning the acoustic model and the language model.

It is a voice quality conversion method that performs voice quality conversion by a voice quality conversion system that outputs synthetic voice obtained by converting voice quality from input voice by an information processing device.
The voice quality conversion system
A data creation device for voice quality conversion and
Equipped with a voice conversion device
The PPG (phoneme posterior probability) model learning step in which the voice quality conversion data creation device inputs a word dictionary and a voice corpus in which text and voice information are associated to generate a PPG conversion model, and
The voice quality conversion data creation device has a voice corpus and a voice parameter generation model learning step for inputting the PPG conversion model and generating a voice parameter generation model.
The PPG conversion model learning step
A step in which the voice quality conversion data creation device generates a dictionary with prosodic information from the word dictionary,
A step in which the voice quality conversion data creation device performs morphological analysis of the text contained in the voice corpus and generates a phoneme sequence with prosody information based on the result of the morphological analysis and the dictionary with prosody information.
A step in which the voice quality conversion data creation device generates an acoustic model from the phoneme array with prosodic information and the voice feature amount output as a result of the feature amount analysis of the voice information included in the voice corpus.
The voice quality conversion data creation device includes a step of learning the acoustic model and generating the PPG conversion model.
The voice parameter generation model learning step
A step in which the voice quality conversion data creation device generates a PPG corresponding to the PPG conversion model from the PPG conversion model and the voice corpus.
A step in which the voice quality conversion data creation device extracts voice parameters from voice information included in the voice corpus, and
The voice quality conversion data creation device includes a step of learning the generated PPG and the voice parameter to generate a voice parameter generation model.
A step in which the voice quality conversion device inputs the input voice, a PPG conversion model generated by the PPG conversion model learning step, and a voice parameter generation model generated by the voice parameter generation model learning step.
A step in which the voice quality conversion device generates a PPG based on a voice feature amount output as a result of a feature amount analysis of voice information included in the voice corpus and the PPG conversion model.
A step of generating a voice parameter based on the generated PPG and the voice parameter generation model by the voice quality conversion device.
A voice quality conversion method, characterized in that the voice quality conversion device includes a step of generating a voice waveform according to the voice parameter and outputting it as an output voice.

The voice quality conversion method according to claim 5, wherein there are a plurality of PPG conversion models generated by the PPG conversion model learning step for each language.

The voice quality conversion method according to claim 6, wherein a prosodic symbol is added based on the characteristics of each language in generating the phoneme sequence with prosody information in the PPG conversion model learning step.

The PPG conversion model learning step includes a step of morphologically analyzing the text of the voice corpus, a step of performing language model learning based on the result of the morphological analysis, and a step of generating a language model.
The voice quality conversion method according to claim 5, further comprising a step of generating the PPG conversion model by learning the acoustic model and the language model.