JP2023171108A

JP2023171108A - Voice conversion device, voice conversion method and program

Info

Publication number: JP2023171108A
Application number: JP2022083351A
Authority: JP
Inventors: 勇祐井島; Yusuke Ijima; 大輔齋藤; Daisuke Saito
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2023-12-01

Abstract

To convert only vocalization skill while preserving a speaker characteristic by changing only a dynamic feature amount of a voice feature amount.SOLUTION: A voice conversion device (1) includes: a model learning unit (11) for learning a voice conversion model which converts a voice feature amount of a conversion source speaker into a voice feature amount of a conversion object speaker; a voice conversion unit (12) for inputting the voice feature amount of the conversion source speaker to a trained voice conversion model and converting the voice feature amount into the voice feature amount of the conversion object speaker; a dynamic feature amount conversion unit (13) for converting the voice feature amount of the conversion source speaker into a converted voice feature amount by using the dynamic feature amount of the voice feature amount of the conversion source speaker and the dynamic feature amount of the voice feature amount of the conversion object speaker; and a voice waveform generation unit (14) for generating a voice waveform from the converted voice feature amount.SELECTED DRAWING: Figure 1

Description

本開示は、入力された話者の音声の発声スキルを変換する音声変換装置、音声変換方法及び、プログラムに関する。 The present disclosure relates to a voice conversion device, a voice conversion method, and a program that convert the voice pronunciation skill of an input speaker.

従来、アナウンサー、声優等の発声の専門家と、それ以外の素人とでは、発声スキルが大きく異なる。本開示において、発声スキルとは、話者により発声される音声の聞き取りやすさを示す指標をいう。たとえば、駅の構内放送、建物の館内放送等において、素人が発声したアナウンスは、聞き取りづらい等の課題があるため、発声した音声の話者性を変えずに、発声スキルのみを変換する技術が必要とされている。本開示において、話者性とは、音声に含まれるスペクトルに代表される音響特徴、音高、発話リズム等に代表される韻律特徴を合わせて指す。 Traditionally, the vocal skills of vocal experts such as announcers and voice actors differ greatly from those of other amateurs. In the present disclosure, the vocalization skill refers to an index indicating the ease of audibility of the voice uttered by the speaker. For example, announcements made by amateurs on station announcements, building announcements, etc. can be difficult to hear, so technology that converts only the vocal skill without changing the identity of the speaker of the voice is needed. is necessary. In the present disclosure, speaker characteristics collectively refer to acoustic features represented by a spectrum included in speech, and prosodic features represented by pitch, speech rhythm, and the like.

図６は、従来の音声変換装置の構成例を示すブロック図である。従来、音声（声質）変換とは、入力された変換元話者の音声特徴量を、目標とする変換対象話者の音声特徴量へ変換する技術とされる。図６に示すように、変換元話者の音声特徴量から変換対象話者の音声特徴量への変換は、音声変換アルゴリズムを用いて学習された音声変換モデルに、変換元話者の音声特徴量を入力することにより行われる。たとえば、非特許文献１には、ベクトル量子化を用いて任意の２名の話者間で音声を変換する音声変換アルゴリズムが記載されている。また、非特許文献２には、人工ニューラルネットワーク（ＡＮＮ）を用いて任意の２名の話者間で音声を変換する音声変換アルゴリズムが記載されている。非特許文献１及び非特許文献２に開示されたアルゴリズムを用いる場合、２名の話者の音声はパラレルデータ（２名の話者が同一の発話を発声した音声をいう。）である必要がある。一方、非特許文献３には、２名の話者の音声がパラレルデータであることを必要としない音声を活用できる、ＶＡＥ（バリエーショナル・オートエンコーダ）を用いた音声変換アルゴリズムが記載されている。 FIG. 6 is a block diagram showing an example of the configuration of a conventional speech conversion device. Conventionally, speech (voice quality) conversion is a technique for converting input speech features of a source speaker into speech features of a target speaker. As shown in Figure 6, the conversion from the voice features of the source speaker to the voice features of the target speaker is performed using a voice conversion model trained using a voice conversion algorithm. This is done by entering the amount. For example, Non-Patent Document 1 describes a speech conversion algorithm that converts speech between two arbitrary speakers using vector quantization. Furthermore, Non-Patent Document 2 describes a speech conversion algorithm that converts speech between two arbitrary speakers using an artificial neural network (ANN). When using the algorithms disclosed in Non-Patent Document 1 and Non-Patent Document 2, the voices of two speakers need to be parallel data (sounds produced by two speakers uttering the same utterance). be. On the other hand, Non-Patent Document 3 describes a voice conversion algorithm using VAE (variational autoencoder) that can utilize voices that do not require the voices of two speakers to be parallel data. .

つぎに、図６に示すように、音声合成アルゴリズムを用いて、変換対象話者の音声特徴量から音声波形を生成する。非特許文献４には、メル対数スペクトル近似（ＭＬＳＡ（Mel-Log Spectrum Approximatation））フィルタを用いた音声合成アルゴリズムが記載されている。 Next, as shown in FIG. 6, a speech waveform is generated from the speech features of the conversion target speaker using a speech synthesis algorithm. Non-Patent Document 4 describes a speech synthesis algorithm using a Mel-Log Spectrum Approximation (MLSA) filter.

さらに、本開示で用い得るアルゴリズムとして、非特許文献５には、動的特徴を用いたパラメータ生成アルゴリズムが、非特許文献６には、重回帰混合正規分布モデルが記載されている。 Further, as algorithms that can be used in the present disclosure, Non-Patent Document 5 describes a parameter generation algorithm using dynamic features, and Non-Patent Document 6 describes a multiple regression mixed normal distribution model.

Abe, Masanobu, et al. “Voice conversion through vector quantization.” Journal of the Acoustical Society of Japan (E) 11.2 (1990): 71-76Abe, Masanobu, et al. “Voice conversion through vector quantization.” Journal of the Acoustical Society of Japan (E) 11.2 (1990): 71-76 Desai, Srinivas, et al. “Spectral mapping using artificial neural networks for voice conversion.” IEEE Transactions on Audio, Speech, and Language Processing 18.5 (2010): 954-964.Desai, Srinivas, et al. “Spectral mapping using artificial neural networks for voice conversion.” IEEE Transactions on Audio, Speech, and Language Processing 18.5 (2010): 954-964. Hsu, Chin-Cheng, et al. “Voice conversion from non-parallel corpora using variational auto-encoder.” Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific. IEEE, 2016.Hsu, Chin-Cheng, et al. “Voice conversion from non-parallel corpora using variational auto-encoder.” Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific. IEEE, 2016. 今井聖、外２名、「音声合成のためのメル対数スペクトル近似（MLSA）フィルタ」、電子情報通信学会論文誌 A Vol.J66-A No.2 pp.122-129、 Feb. 1983.Kiyoshi Imai and two others, "Mel-logarithm spectral approximation (MLSA) filter for speech synthesis," IEICE Transactions A Vol.J66-A No.2 pp.122-129, Feb. 1983. 益子貴史、外３名、「動的特徴を用いたHMMに基づく音声合成」、信学論、vol.J79-D-II、no.12、pp.2184-2190、Dec. 1996.Takashi Mashiko and three others, "Speech synthesis based on HMM using dynamic features", IEICE Journal, vol.J79-D-II, no.12, pp.2184-2190, Dec. 1996. 太田久美、「重回帰混合正規分布モデルに基づく声質変換・制御法」、奈良先端科学技術大学院大学修士論文、 2008.Kumi Ota, "Voice quality conversion/control method based on multiple regression mixed normal distribution model", Master's thesis, Nara Institute of Science and Technology, 2008.

しかし、非特許文献１又は非特許文献２に開示された音声変換アルゴリズムを用いた、従来の音声変換装置によると、変換元話者の話者性もが、変換対象話者の話者性に変換されてしまい、変換元話者の話者性を保持したまま発声スキルのみを変換することが出来ないという課題があった。 However, according to the conventional speech conversion device using the speech conversion algorithm disclosed in Non-Patent Document 1 or Non-Patent Document 2, the speaker characteristics of the conversion source speaker are not the same as the speaker characteristics of the conversion target speaker. There was a problem in that it was not possible to convert only the vocal skills of the original speaker while preserving the speaker characteristics of the original speaker.

そこで、本開示では、発声スキルを滑舌の良し悪しと捉え、素人の話者の音声の音声特徴量の時間的変動のみを、専門家のものへと変換することにより、発声スキルのみを変換する技術に着目した。 Therefore, in the present disclosure, only the vocal skill is converted by considering the vocal skill as the quality of the smooth tongue and converting only the temporal fluctuation of the voice feature amount of the voice of an amateur speaker to that of an expert. We focused on the technology to do this.

かかる事情に鑑みてなされた本開示の目的は、音声特徴量の時間的変動（動的特徴量）のみを変換することにより、話者性を保持したまま発声スキルのみを変換する音声変換装置、音声変換方法、及びプログラムを提供することにある。 The present disclosure was made in view of the above circumstances, and an object of the present disclosure is to provide a speech conversion device that converts only the vocal skill while preserving the speaker characteristics by converting only the temporal fluctuations (dynamic features) of the speech features; The purpose of this invention is to provide a voice conversion method and program.

上記課題を解決するため、本実施形態に係る音声変換装置は、話者の音声特徴量の動的特徴量を変換する音声変換装置であって、変換元話者の音声特徴量を、変換対象話者の音声特徴量へ変換する音声変換モデルを学習するモデル学習部と、前記変換元話者の音声特徴量を学習済みの音声変換モデルへ入力して、前記変換対象話者の音声特徴量に変換する音声変換部と、前記変換元話者の音声特徴量の動的特徴量と、前記変換対象話者の音声特徴量の動的特徴量とを用いて、前記変換元話者の音声特徴量を変換後音声特徴量に変換する動的特徴量変換部と、前記変換後音声特徴量から音声波形を生成する音声波形生成部と、を備える。 In order to solve the above problems, the speech conversion device according to the present embodiment is a speech conversion device that converts the dynamic feature amount of the speech feature amount of the speaker, and converts the speech feature amount of the source speaker into the speech feature amount of the conversion target. a model learning unit that learns a voice conversion model to convert into voice features of the speaker; and a model learning unit that learns a voice conversion model that converts the voice features of the conversion source speaker into the learned voice conversion model, and inputs the voice features of the conversion source speaker to the trained voice conversion model to generate voice features of the conversion target speaker. A speech conversion unit that converts the speech of the conversion source speaker into The apparatus includes a dynamic feature converter that converts a feature into a converted audio feature, and an audio waveform generator that generates an audio waveform from the converted audio feature.

上記課題を解決するため、本実施形態に係る音声変換方法は、話者の音声特徴量の動的特徴量を変換する音声変換方法であって、音声変換装置により、変換元話者の音声特徴量を、変換対象話者の音声特徴量へ変換する音声変換モデルを学習するステップと、前記変換元話者の音声特徴量を学習済みの音声変換モデルへ入力して、前記変換対象話者の音声特徴量に変換するステップと、前記変換元話者の音声特徴量の動的特徴量と、前記変換対象話者の音声特徴量の動的特徴量とを入力して、前記変換元話者の音声特徴量を変換後音声特徴量に変換するステップと、前記変換後音声特徴量から音声波形を生成するステップと、を含む。 In order to solve the above problems, the voice conversion method according to the present embodiment is a voice conversion method that converts the dynamic feature amount of the voice feature amount of the speaker. a step of learning a speech conversion model that converts the voice features of the conversion target speaker into speech features of the conversion target speaker; a step of converting the speech feature into a voice feature, inputting the dynamic feature of the voice feature of the conversion source speaker and the dynamic feature of the voice feature of the conversion target speaker; The method includes the steps of converting the voice feature amount into a converted voice feature amount, and generating a voice waveform from the converted voice feature amount.

上記課題を解決するため、本実施形態に係るプログラムは、コンピュータを、上記音声変換装置として機能させる。 In order to solve the above problem, a program according to this embodiment causes a computer to function as the above speech conversion device.

本開示によれば、音声特徴量の時間変動（動的特徴量）のみを変換することにより、話者性を保持したまま発声スキルのみを変換することが可能となる。 According to the present disclosure, by converting only the temporal variation of the voice feature amount (dynamic feature amount), it is possible to convert only the vocal skill while retaining speaker characteristics.

第１の実施形態に係る音声変換装置の構成例を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of a speech conversion device according to a first embodiment. 第１の実施形態に係る音声変換装置が実行する音声変換方法の一例を示すフローチャートである。2 is a flowchart illustrating an example of a voice conversion method executed by the voice conversion device according to the first embodiment. 第２の実施形態に係る音声変換装置の構成例を示すブロック図である。FIG. 2 is a block diagram showing a configuration example of a speech conversion device according to a second embodiment. 第２の実施形態に係る音声変換装置が実行する音声変換方法の一例を示すフローチャートである。It is a flow chart which shows an example of the voice conversion method performed by the voice conversion device concerning a 2nd embodiment. 音声変換装置として機能するコンピュータの概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a computer functioning as a voice conversion device. 従来の音声変換装置の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a conventional speech conversion device.

以下、本発明を実施するための形態が、図面を参照しながら詳細に説明される。本発明は、以下の実施形態に限定されるものではなく、その要旨の範囲内で種々変形して実施することができる。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. The present invention is not limited to the following embodiments, and can be implemented with various modifications within the scope of the gist.

（第１の実施形態）
図１は、第１の実施形態に係る音声変換装置１の構成例を示すブロック図である。図１に示すように、第１の実施形態に係る音声変換装置１は、モデル学習部１１と、音声変換部１２と、動的特徴量変換部１３と、音声波形生成部１４と、を備える。音声変換装置１は、話者の音声特徴量の動的特徴量を変換する。モデル学習部１１、音声変換部１２、動的特徴量変換部１３及び音声波形生成部１４により制御演算回路（コントローラ）が構成される。該制御演算回路は、ＡＳＩＣ(Application Specific Integrated Circuit)、ＦＰＧＡ(Field-Programmable Gate Array)等の専用のハードウェアによって構成されてもよいし、プロセッサによって構成されてもよいし、双方を含んで構成されてもよい。 (First embodiment)
FIG. 1 is a block diagram showing a configuration example of a speech conversion device 1 according to the first embodiment. As shown in FIG. 1, the speech conversion device 1 according to the first embodiment includes a model learning section 11, a speech conversion section 12, a dynamic feature amount conversion section 13, and a speech waveform generation section 14. . The speech conversion device 1 converts the dynamic feature amount of the speaker's speech feature amount. The model learning section 11, the speech conversion section 12, the dynamic feature amount conversion section 13, and the speech waveform generation section 14 constitute a control calculation circuit (controller). The control calculation circuit may be composed of dedicated hardware such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field-Programmable Gate Array), or may be composed of a processor, or may be composed of both. may be done.

モデル学習部１１は、予め音声記憶部１５に保存されている変換元話者の音声特徴量を、変換対象話者の音声特徴量へ変換する音声変換モデルを学習する。モデル学習部１１は、非特許文献１に記載されたベクトル量子化を用いた音声変換アルゴリズム、非特許文献２に記載された人工ニューラルネットワーク（ＡＮＮ）を用いた音声変換アルゴリズム、又は非特許文献３に記載されたＶＡＥ（バリエーショナル・オートエンコーダ）を用いた音声変換アルゴリズムを、学習アルゴリズムとして用いてもよい。 The model learning unit 11 learns a voice conversion model that converts voice features of a conversion source speaker stored in advance in the voice storage unit 15 into voice features of a conversion target speaker. The model learning unit 11 uses a speech conversion algorithm using vector quantization described in Non-Patent Document 1, a speech conversion algorithm using an artificial neural network (ANN) described in Non-Patent Document 2, or a speech conversion algorithm using an artificial neural network (ANN) described in Non-Patent Document 3. A speech conversion algorithm using a VAE (variational autoencoder) described in 1.2 may be used as the learning algorithm.

モデル学習部１１が扱う音声は、音声信号に対してフーリエ変換、信号処理等を行った結果、得られる音声特徴量（音高パラメータ（基本周波数等）、スペクトルパラメータ（ケプストラム、メルケプストラム等））として音声記憶部１５に保持されている。本開示では、フーリエ変換、信号処理等により得られた音声特徴量（一般的に静的特徴量ともいう。）は、静的特徴量のみではなく、各時刻における１フレーム（音声フレーム）前から１フレーム後への時間的変動を捉えた動的特徴量も含んでいるものとする。上述した非特許文献１又は非特許文献２に記載された音声変換アルゴリズムを使用する場合、音声はパラレルデータ（２名の話者が同一発話を発声した音声）である必要がある。また、上記のアルゴリズムを用いる場合は、各話者の音声はあらかじめＤＰマッチング（ＤＴＷ; Dynamic Time Warping）等により、音声の時間情報の対応関係をとる必要がある。一方、非特許文献３に記載された音声変換アルゴリズムを使用する場合は、音声はパラレルデータである必要はなく、時間情報の対応付けも必要とされない。 The voice handled by the model learning unit 11 is the voice feature amount (pitch parameter (fundamental frequency, etc.), spectral parameter (cepstrum, mel cepstrum, etc.)) obtained as a result of performing Fourier transform, signal processing, etc. on the voice signal. It is held in the audio storage unit 15 as a. In the present disclosure, audio features obtained by Fourier transform, signal processing, etc. (generally referred to as static features) are not only static features, but also from one frame (audio frame) before each time. It is assumed that the dynamic feature amount that captures the temporal change after one frame is also included. When using the voice conversion algorithm described in Non-Patent Document 1 or Non-Patent Document 2 mentioned above, the voice needs to be parallel data (sound produced by two speakers uttering the same utterance). In addition, when using the above algorithm, it is necessary to determine the correspondence of the time information of each speaker's voice in advance by DP matching (DTW; Dynamic Time Warping) or the like. On the other hand, when using the voice conversion algorithm described in Non-Patent Document 3, the voice does not need to be parallel data, and there is no need to associate time information.

音声変換部１２は、変換元話者の音声特徴量２１を、モデル学習部１１により生成された学習済みの音声変換モデル１１ａへ入力して、変換対象話者の音声特徴量２２に変換する。 The speech conversion section 12 inputs the speech feature amount 21 of the conversion source speaker into the trained speech conversion model 11a generated by the model learning section 11, and converts it into the speech feature amount 22 of the conversion target speaker.

動的特徴量変換部１３は、変換元話者の音声特徴量２１の動的特徴量と、変換対象話者の音声特徴量２２の動的特徴量とを用いて、変換元話者の音声特徴量２１を変換後音声特徴量２３に変換する。 The dynamic feature converter 13 converts the voice of the source speaker using the dynamic feature of the voice feature 21 of the source speaker and the dynamic feature of the voice feature 22 of the target speaker. The feature amount 21 is converted into a post-conversion audio feature amount 23.

動的特徴量変換部１３は、変換元話者の音声特徴量２１の動的特徴量を、変換対象話者の音声特徴量２２の動的特徴量と差し替えて、変換対象話者の音声特徴量２２の動的特徴量を、変換元話者の音声特徴量２１の動的特徴量として取り扱うことにより、変換後の動的特徴量を生成してもよい。また、変換元話者の音声特徴量２１の動的特徴量と、変換対象話者の音声特徴量２２の動的特徴量との重み付き和を音声フレームごとに求めることにより、変換後動的特徴量を生成してもよい。後者の場合、変換元話者の音声特徴量２１の動的特徴量と、変換対象話者の音声特徴量２２の動的特徴量との重み付けにより、変換対象話者の発声スキルをどれだけ重視した変換を行うかを指定することができる。その後、たとえば非特許文献５に記載された動的特徴を用いたパラメータ生成アルゴリズム等により、変換元話者の音声特徴量２１は、変換後動的特徴量を用いて、変換対象話者の音声特徴量２２の動的特徴量を反映した変換後音声特徴量２３に変換される。 The dynamic feature amount conversion unit 13 replaces the dynamic feature amount of the voice feature amount 21 of the conversion source speaker with the dynamic feature amount of the voice feature amount 22 of the conversion target speaker, and converts the voice feature amount of the conversion target speaker into a voice feature amount of the conversion target speaker. The converted dynamic feature amount may be generated by treating the dynamic feature amount 22 as the dynamic feature amount of the speech feature amount 21 of the conversion source speaker. In addition, by calculating the weighted sum of the dynamic feature quantity of the speech feature quantity 21 of the conversion source speaker and the dynamic feature quantity of the speech feature quantity 22 of the conversion target speaker for each speech frame, the post-conversion dynamic A feature quantity may also be generated. In the latter case, it is determined how much importance is given to the vocal skill of the target speaker by weighting the dynamic feature values of the voice feature value 21 of the conversion source speaker and the dynamic feature value of the voice feature value 22 of the target speaker. You can specify whether to perform the specified conversion. Thereafter, for example, using a parameter generation algorithm using dynamic features described in Non-Patent Document 5, the speech feature amount 21 of the conversion source speaker is converted to the speech feature amount 21 of the conversion target speaker using the post-conversion dynamic feature amount. The feature amount 22 is converted into a post-conversion audio feature amount 23 that reflects the dynamic feature amount.

音声波形生成部１４は、変換後音声特徴量２３から音声波形２４を生成する。音声波形生成部１４は、非特許文献４に記載されたメル対数スペクトル近似（ＭＬＳＡ（Mel-Log Spectrum Approximatation））フィルタ等を用いた音声合成アルゴリズムを用いて、音声波形２４を生成してもよい。 The audio waveform generation unit 14 generates an audio waveform 24 from the converted audio feature amount 23. The speech waveform generation unit 14 may generate the speech waveform 24 using a speech synthesis algorithm using a Mel-Log Spectrum Approximation (MLSA) filter described in Non-Patent Document 4. .

音声記憶部１５は、変換の対象となる２名の話者が発話した音声を音声特徴量として収録（保持）しており、音声学習部１１の要求に応じ、音声特徴量を音声学習部１１へ出力する。 The voice storage unit 15 records (holds) the voices uttered by two speakers who are the targets of conversion as voice features, and converts the voice features to the voice learning unit 11 in response to a request from the voice learning unit 11. Output to.

図２は、第１の実施形態に係る音声変換装置１が実行する音声変換方法の一例を示すフローチャートである。 FIG. 2 is a flowchart illustrating an example of a voice conversion method executed by the voice conversion device 1 according to the first embodiment.

ステップＳ１０１では、モデル学習部１１が、変換元話者の音声特徴量２１を変換対象話者の音声特徴量２２へ変換する音声変換モデル１１ａを学習する。 In step S101, the model learning unit 11 learns a speech conversion model 11a that converts the speech feature amount 21 of the conversion source speaker into the speech feature amount 22 of the conversion target speaker.

ステップＳ１０２では、音声変換部１２が、学習済みの音声変換モデル１１ａへ変換元話者の音声特徴量２１を入力して、変換対象話者の音声特徴量２２に変換する。 In step S102, the speech conversion unit 12 inputs the speech feature amount 21 of the conversion source speaker to the trained speech conversion model 11a, and converts it into the speech feature amount 22 of the conversion target speaker.

ステップＳ１０３では、動的特徴量変換部１３が、変換元話者の音声特徴量２１の動的特徴量と、変換対象話者の音声特徴量２２の動的特徴量とを用いて、変換元話者の音声特徴量２１を変換後音声特徴量２３に変換する。 In step S103, the dynamic feature amount conversion unit 13 uses the dynamic feature amount of the voice feature amount 21 of the conversion source speaker and the dynamic feature amount of the voice feature amount 22 of the conversion target speaker. The speaker's voice feature amount 21 is converted into a converted voice feature amount 23.

ステップＳ１０４では、音声波形生成部１４が、変換後音声特徴量２３から音声波形２４を生成する。 In step S104, the audio waveform generation unit 14 generates the audio waveform 24 from the converted audio feature amount 23.

本実施形態に係る音声変換装置１は、非特許文献１～非特許文献３に記載された従来技術が、音声特徴量全体を変換するのとは異なり、音声特徴量の動的特徴量（時間変動）のみを変換対象とする。これにより、音声変換装置１によれば、音声の話者性を変更することなく、滑舌の良し悪し等、発声スキルのみを変換することが可能になる。また、変換元話者を発声の素人、変換対象話者をアナウンサー、声優等の発声の専門家とすることにより、素人（変換元話者）の発声スキルを専門家（変換対象話者）の発声スキルへと近づけることが可能になる。 Unlike the conventional techniques described in Non-Patent Documents 1 to 3, which convert the entire audio feature amount, the audio conversion device 1 according to the present embodiment converts the dynamic feature amount (temporal feature amount) of the audio feature amount. (variation) is subject to conversion. Thereby, according to the voice conversion device 1, it is possible to convert only the pronunciation skill, such as the quality of the tongue, without changing the speaker characteristics of the voice. In addition, by setting the conversion source speaker to be a vocal amateur and the conversion target speaker to be a vocal expert such as an announcer or voice actor, it is possible to improve the vocal skills of the amateur (conversion source speaker) to that of an expert (conversion target speaker). It will be possible to get closer to vocal skills.

（第２の実施形態）
図３は、第２の実施形態に係る音声変換装置１′の構成例を示すブロック図である。図３に示すように、第２の実施形態に係る音声変換装置１′は、モデル学習部１１′と、音声変換部１２′と、動的特徴量変換部１３と、音声波形生成部１４と、を備える。音声変換装置１′は、話者の音声特徴量の動的特徴量を変換する。本実施形態に係る音声変換装置１′は、第１の実施形態に係る音声変換装置１と比較して、モデル学習部１１′と、音声変換部１２′とが有する機能が異なるが、動的特徴量変換部１３及び音声波形生成部１４の機能は同じである。第１の実施形態と同一の構成については、第１の実施形態と同一の参照番号を付して適宜説明を省略する。モデル学習部１１′、音声変換部１２′、動的特徴量変換部１３及び音声波形生成部１４により制御演算回路（コントローラ）が構成される。該制御演算回路は、ＡＳＩＣ(Application Specific Integrated Circuit)、ＦＰＧＡ(Field-Programmable Gate Array)等の専用のハードウェアによって構成されてもよいし、プロセッサによって構成されてもよいし、双方を含んで構成されてもよい。 (Second embodiment)
FIG. 3 is a block diagram showing a configuration example of a speech conversion device 1' according to the second embodiment. As shown in FIG. 3, the speech conversion device 1' according to the second embodiment includes a model learning section 11', a speech conversion section 12', a dynamic feature amount conversion section 13, and a speech waveform generation section 14. , is provided. The speech conversion device 1' converts the dynamic feature amount of the speaker's speech feature amount. The voice conversion device 1' according to this embodiment differs from the voice conversion device 1 according to the first embodiment in the functions possessed by the model learning section 11' and the voice conversion section 12'. The features converter 13 and audio waveform generator 14 have the same functions. The same configurations as in the first embodiment are given the same reference numbers as in the first embodiment, and the description thereof will be omitted as appropriate. The model learning section 11', the voice conversion section 12', the dynamic feature amount conversion section 13, and the voice waveform generation section 14 constitute a control calculation circuit (controller). The control calculation circuit may be composed of dedicated hardware such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field-Programmable Gate Array), or may be composed of a processor, or may be composed of both. may be done.

モデル学習部１１′は、複数の話者の音声特徴量と、各話者に付与された発声スキルとを入力して、任意に変換元話者に定めた１名の話者の音声特徴量を、変換対象話者に定めた他の複数の話者の音声特徴量にそれぞれ変換する複数の音声変換モデルを学習する。モデル学習部１１′は、複数の音声変換モデルのうち、変換元話者に定めた１名の話者の音声特徴量２１を、任意に定めた目標発声スキル２５に合致する発声スキルを有する1名の変換対象話者の音声特徴量２２に変換する一つの音声変換モデル１１ａ′を保持する。たとえば、１０名の話者が発話した音声の音声特徴量と、１０名の話者のそれぞれに付与された発声スキルが入力される場合、モデル学習部１１′は、任意に定めた１名の変換元話者の音声特徴量を、その他の９名の変換対象話者の音声特徴量に変換する９通りの音声変換モデルを学習し、次に該１名の変換元話者の音声特徴量２１を、９名のうち別途任意に定めた目標発声スキル２５に合致する発声スキルを有する１名の変換対象話者の音声特徴量２２に変換する一つの音声変換モデル１１ａ′のみを保持する。学習アルゴリズムは、非特許文献６に記載された重回帰混合正規分布モデルを用いてもよい。非特許文献６に記載された重回帰混合正規分布モデルでは、従来の音声変換の拡張として、任意の声質（太い声から細い声等）へと変換する技術を提案しているが、本実施例では、声質の代わりに発声スキルを付与して学習することにより、任意の発声スキルへの変換を行う。 The model learning unit 11' inputs the voice features of a plurality of speakers and the pronunciation skills assigned to each speaker, and arbitrarily selects the voice features of one speaker determined as the conversion source speaker. A plurality of speech conversion models are learned that convert the speech features of a plurality of other speakers determined as conversion target speakers, respectively. The model learning unit 11' converts the voice feature amount 21 of one speaker determined as the conversion source speaker among the plurality of voice conversion models into one having a vocalization skill that matches an arbitrarily determined target vocalization skill 25. One speech conversion model 11a' that converts the name into the speech feature amount 22 of the speaker to be converted is held. For example, when the voice features of voices uttered by 10 speakers and the vocalization skills assigned to each of the 10 speakers are input, the model learning unit 11' Nine types of speech conversion models are learned that convert the voice features of the conversion source speaker into the voice features of nine other conversion target speakers, and then the voice features of the one conversion source speaker are learned. Only one voice conversion model 11a' is retained that converts 21 into the voice feature amount 22 of one conversion target speaker among the nine speakers whose voice skill matches the separately arbitrarily determined target voice skill 25. The learning algorithm may use the multiple regression mixed normal distribution model described in Non-Patent Document 6. The multiple regression mixed normal distribution model described in Non-Patent Document 6 proposes a technique for converting any voice quality (from a thick voice to a thin voice, etc.) as an extension of conventional voice conversion. Now, by assigning and learning vocalization skills instead of voice quality, conversion to any vocalization skill is performed.

音声変換部１２′は、変換元話者の音声特徴量２１と、目標発声スキル２５とをモデル学習部１１′により学習済みの音声変換モデル１１ａ′へ入力して、変換元話者の音声特徴量２１を、目標発声スキル２５に合致する発声スキルを有する変換対象話者の音声特徴量２２に変換する。 The voice conversion unit 12' inputs the voice features 21 of the conversion source speaker and the target pronunciation skill 25 to the voice conversion model 11a' trained by the model learning unit 11', and converts the voice features of the conversion source speaker into the voice conversion model 11a'. The quantity 21 is converted into the voice feature quantity 22 of the conversion target speaker whose pronunciation skill matches the target pronunciation skill 25.

音声変換装置１′が備える動的特徴量変換部１３及び音声波形生成部１４は、第１の実施形態に係る音声変換装置１が備える動的特徴量変換部１３及び音声波形生成部１４と同一である。動的特徴量変換部１３は、変換元話者の音声特徴量２１の動的特徴量と、変換対象話者の音声特徴量２２の動的特徴量とを用いて、変換元話者の音声特徴量２１を変換後音声特徴量２３に変換する。音声波形生成部１４は、変換後音声特徴量２３から音声波形２４を生成する。 The dynamic feature amount conversion unit 13 and the audio waveform generation unit 14 included in the speech conversion device 1′ are the same as the dynamic feature amount conversion unit 13 and the audio waveform generation unit 14 included in the speech conversion device 1 according to the first embodiment. It is. The dynamic feature converter 13 converts the voice of the source speaker using the dynamic feature of the voice feature 21 of the source speaker and the dynamic feature of the voice feature 22 of the target speaker. The feature amount 21 is converted into a post-conversion audio feature amount 23. The audio waveform generation unit 14 generates an audio waveform 24 from the converted audio feature amount 23.

音声記憶部１５′は、第１の実施形態に係る音声記憶部１５が２名の話者が発話した音声の音声特徴量を収録しているのに対し、複数の話者（たとえば、１０名等より多くの話者）が発話した音声の音声特徴量と、各話者に付与された発声スキルを収録している。発声スキルは、評価者の聴取により各話者に付与された主観スコアを数値表現したもの（たとえば、１：スキルが著しく低い．．．５：スキルが著しく高い）を使用することが望ましい。音声記憶部１５′は、変換の対象となる複数の話者が発話した音声の音声特徴量と、各話者に付与された発声スキルとを収録（保持）しており、音声学習部１１′の要求に応じ、音声特徴量と発声スキルとを音声学習部１１′へ出力する。 While the voice storage unit 15 according to the first embodiment stores voice features of voices uttered by two speakers, the voice storage unit 15' stores voice features of voices uttered by a plurality of speakers (for example, 10 people). It records the audio features of voices uttered by more speakers than others) and the vocal skills assigned to each speaker. It is desirable to use a numerical representation of the subjective score assigned to each speaker based on listening by the evaluator (for example, 1: extremely low skill... 5: extremely high skill) for the speaking skill. The voice storage unit 15' records (holds) the voice features of voices uttered by a plurality of speakers to be converted and the vocalization skills assigned to each speaker, and the voice learning unit 11' In response to the request, the voice feature amount and the vocalization skill are output to the voice learning section 11'.

図４は、第２の実施形態に係る音声変換装置１′が実行する音声変換方法の一例を示すフローチャートである。 FIG. 4 is a flowchart illustrating an example of a voice conversion method executed by the voice conversion device 1' according to the second embodiment.

ステップＳ２０１では、モデル学習部１１′が、変換元話者の音声特徴量２１を、他の複数の変換対象話者の音声特徴量２２へ変換する、複数の音声変換モデル１１ａ′を学習する。さらに、モデル学習部１１′は、変換元話者の音声特徴量２１を、目標発声スキル２５に合致した発声スキルを有する変換対象話者の音声特徴量２２へ変換する一つの音声変換モデル１１ａ′のみを保持する。 In step S201, the model learning unit 11' learns a plurality of speech conversion models 11a' that convert the speech feature amount 21 of the conversion source speaker into the speech feature amount 22 of a plurality of other conversion target speakers. Further, the model learning unit 11' creates a voice conversion model 11a' that converts the voice feature amount 21 of the conversion source speaker into the voice feature amount 22 of the conversion target speaker whose voice skill matches the target voice skill 25. only hold.

ステップＳ２０２では、音声変換部１２′が、学習済みの音声変換モデル１１ａ′へ変換元話者の音声特徴量２１と目標発声スキル２５とを入力して、変換元話者の音声特徴量２１を目標発声スキル２５に合致する発声スキルを有する変換対象話者の音声特徴量２２に変換する。 In step S202, the voice conversion unit 12' inputs the voice feature amount 21 of the conversion source speaker and the target vocalization skill 25 to the learned voice conversion model 11a', and converts the voice feature amount 21 of the conversion source speaker into the learned voice conversion model 11a'. It is converted into a voice feature amount 22 of a conversion target speaker whose voice skill matches the target voice skill 25.

ステップＳ２０３では、動的特徴量変換部１３が、変換元話者の音声特徴量２１の動的特徴量と、変換対象話者の音声特徴量２２の動的特徴量とを用いて、変換元話者の音声特徴量２１を変換後音声特徴量２３に変換する。 In step S203, the dynamic feature amount conversion unit 13 uses the dynamic feature amount of the voice feature amount 21 of the conversion source speaker and the dynamic feature amount of the voice feature amount 22 of the conversion target speaker. The speaker's voice feature amount 21 is converted into a converted voice feature amount 23.

ステップＳ２０４では、音声波形生成部１４が、変換後音声特徴量２３から音声波形２４を生成する。 In step S204, the audio waveform generation unit 14 generates the audio waveform 24 from the converted audio feature amount 23.

第１の実施形態に係る音声変換装置１によれば、発声スキルが高い話者を発声の専門家であると仮定して、一方の話者（発声の素人）の発声スキルを、もう一方の話者（発声の専門家）の発声スキルに変換する。しかし、実際には発声の素人の中にも発声スキルが高い話者もいれば、発声の専門家の中でもそれぞれの発声スキルは異なる。本開示に係る音声変換装置１′によれば、複数の話者の音声特徴量と、各話者に付与された発声スキルとを用いることにより、任意に定めた変換元話者の音声特徴量を、任意の目標発声スキルに合致する発声スキルを有する変換対象話者の音声特徴量へ変換することが可能となる。 According to the speech conversion device 1 according to the first embodiment, it is assumed that a speaker with a high pronunciation skill is an expert in pronunciation, and the pronunciation skill of one speaker (amateur in pronunciation) is compared with that of the other speaker. Convert to the vocal skills of the speaker (pronunciation expert). However, in reality, some vocal amateurs have high vocal skills, and even vocal experts have different vocal skills. According to the speech conversion device 1' according to the present disclosure, the speech features of the conversion source speaker are arbitrarily determined by using the speech features of a plurality of speakers and the vocalization skills assigned to each speaker. can be converted into the voice feature amount of a conversion target speaker who has a vocalization skill that matches an arbitrary target vocalization skill.

上記の音声変換装置１及び１′を機能させるために、プログラム命令を実行可能なコンピュータを用いることも可能である。図５は、音声変換装置として機能するコンピュータの概略構成を示すブロック図である。ここで、音声変換装置１及び１′として機能するコンピュータは、汎用コンピュータ、専用コンピュータ、ワークステーション、ＰＣ（Personal Computer）、電子ノートパッド等であってもよい。プログラム命令は、必要なタスクを実行するためのプログラムコード、コードセグメント等であってもよい。 It is also possible to use a computer capable of executing program instructions in order to function the above-mentioned speech conversion devices 1 and 1'. FIG. 5 is a block diagram showing a schematic configuration of a computer functioning as a voice conversion device. Here, the computers functioning as the voice conversion devices 1 and 1' may be general-purpose computers, special-purpose computers, workstations, PCs (Personal Computers), electronic notepads, or the like. Program instructions may be program code, code segments, etc. to perform necessary tasks.

図５に示すように、コンピュータ１００は、プロセッサ１１０と、記憶部としてＲＯＭ（Read Only Memory）１２０、ＲＡＭ（Random Access Memory）１３０、及びストレージ１４０と、入力部１５０と、出力部１６０と、通信インターフェース（Ｉ／Ｆ）１７０と、を備える。各構成は、バス１８０を介して相互に通信可能に接続されている。 As shown in FIG. 5, the computer 100 includes a processor 110, a ROM (Read Only Memory) 120, a RAM (Random Access Memory) 130, and a storage 140 as storage units, an input unit 150, an output unit 160, and communication units. An interface (I/F) 170 is provided. Each configuration is communicably connected to each other via a bus 180.

ＲＯＭ１２０は、各種プログラム及び各種データを保存する。ＲＡＭ１３０は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４０は、ＨＤＤ（Hard Disk Drive)又はＳＳＤ(Solid State Drive）により構成され、オペレーティングシステムを含む各種プログラム及び各種データを保存する。本開示では、ＲＯＭ１２０又はストレージ１４０に、本開示に係るプログラムが保存されている。 The ROM 120 stores various programs and data. The RAM 130 temporarily stores programs or data as a work area. The storage 140 is configured with an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data. In the present disclosure, a program according to the present disclosure is stored in the ROM 120 or the storage 140.

プロセッサ１１０は、具体的にはＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＳｏＣ（System on a Chip）等であり、同種又は異種の複数のプロセッサにより構成されてもよい。プロセッサ１１０は、ＲＯＭ１２０又はストレージ１４０からプログラムを読み出し、ＲＡＭ１３０を作業領域としてプログラムを実行することで、上記各構成の制御及び各種の演算処理を行う。なお、これらの処理内容の少なくとも一部をハードウェアで実現することとしてもよい。 Specifically, the processor 110 is a CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphics Processing Unit), DSP (Digital Signal Processor), SoC (System on a Chip), etc., and may be of the same type or different types. It may be configured with a plurality of processors. The processor 110 reads a program from the ROM 120 or the storage 140 and executes the program using the RAM 130 as a work area, thereby controlling each of the above components and performing various calculation processes. Note that at least a part of these processing contents may be realized by hardware.

プログラムは、音声変換装置１及び１′が読み取り可能な記録媒体に記録されていてもよい。このような記録媒体を用いれば、音声変換装置１及び１′にインストールすることが可能である。ここで、プログラムが記録された記録媒体は、非一過性（non-transitory）の記録媒体であってもよい。非一過性の記録媒体は、特に限定されるものではないが、例えば、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＵＳＢ（Universal Serial Bus）メモリ等であってもよい。また、このプログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 The program may be recorded on a recording medium readable by the speech conversion devices 1 and 1'. By using such a recording medium, it is possible to install it in the voice conversion devices 1 and 1'. Here, the recording medium on which the program is recorded may be a non-transitory recording medium. The non-transitory recording medium is not particularly limited, and may be, for example, a CD-ROM, a DVD-ROM, a USB (Universal Serial Bus) memory, or the like. Further, this program may be downloaded from an external device via a network.

上述の実施形態は代表的な例として説明したが、本開示の趣旨及び範囲内で、多くの変更及び置換ができることは当業者に明らかである。したがって、本発明は、上述の実施形態によって制限するものと解するべきではなく、特許請求の範囲から逸脱することなく、種々の変形又は変更が可能である。たとえば、実施形態の構成図に記載の複数の構成ブロックを１つに組み合わせたり、あるいは１つの構成ブロックを分割したりすることが可能である。 Although the embodiments described above have been described as representative examples, it will be apparent to those skilled in the art that many modifications and substitutions can be made within the spirit and scope of this disclosure. Therefore, the present invention should not be construed as being limited to the above-described embodiments, and various modifications and changes can be made without departing from the scope of the claims. For example, it is possible to combine a plurality of configuration blocks described in the configuration diagram of the embodiment into one, or to divide one configuration block.

１, １′ 音声変換装置
１１, １１′ モデル学習部
１１ａ，１１ａ′ 音声変換モデル
１２, １２′ 音声変換部
１３動的特徴量変換部
１４音声波形生成部
１５, １５′ 音声記憶部
２１変換元話者の音声特徴量
２２変換対象話者の音声特徴量
２３変換後音声特徴量
２４音声波形
２５目標発声スキル
１００コンピュータ
１１０プロセッサ
１２０ＲＯＭ
１３０ＲＡＭ
１４０ストレージ
１５０入力部
１６０出力部
１７０通信インターフェース（Ｉ／Ｆ）
１８０バス 1, 1' Voice conversion device
11, 11' Model learning section
11a, 11a' Voice conversion model
12, 12' Voice converter
13 Dynamic feature converter
14 Audio waveform generation section 15, 15' Audio storage section
21 Voice features of conversion source speaker
22 Voice features of conversion target speaker
23 Post-conversion voice feature amount 24 Voice waveform 25 Target vocal skill 100 Computer 110 Processor 120 ROM
130 RAM
140 Storage 150 Input section 160 Output section 170 Communication interface (I/F)
180 bus

Claims

A voice conversion device that converts a dynamic feature of a speaker's voice feature,
a model learning unit that learns a voice conversion model that converts voice features of a conversion source speaker to voice features of a conversion target speaker;
a voice conversion unit that inputs voice features of the conversion source speaker into a trained voice conversion model and converts them into voice features of the conversion target speaker;
Using the dynamic feature amount of the voice feature amount of the conversion source speaker and the dynamic feature amount of the voice feature amount of the conversion target speaker, the voice feature amount of the conversion source speaker is converted into a post-conversion voice feature amount. a dynamic feature conversion unit that converts into
an audio waveform generation unit that generates an audio waveform from the converted audio feature;
A voice conversion device comprising:

The dynamic feature amount conversion unit replaces the dynamic feature amount of the voice feature amount of the conversion source speaker with the dynamic feature amount of the voice feature amount of the conversion target speaker, thereby converting the voice feature amount after conversion. The speech conversion device according to claim 1, which generates.

The dynamic feature converter calculates a weighted sum of a dynamic feature of the speech feature of the conversion source speaker and a dynamic feature of the speech feature of the conversion target speaker for each audio frame. 2. The method according to claim 1, wherein a post-conversion dynamic feature is generated, and the speech feature of the conversion source speaker is converted to the post-conversion speech feature using the post-conversion dynamic feature. Voice conversion device.

The model learning unit inputs the voice features of a plurality of speakers and the pronunciation skills assigned to each speaker, and calculates the voice features of one speaker arbitrarily determined as the conversion source speaker. , learn multiple voice conversion models that each convert to the voice features of multiple other speakers determined as the conversion target speaker,
The voice conversion unit inputs the voice features of the conversion source speaker and the target vocalization skill into a trained voice conversion model, and converts the voice features of the conversion source speaker into a match with the target voice skill. 4. The speech conversion device according to claim 1, wherein the speech conversion device converts into a voice feature amount of a conversion target speaker who has a vocalization skill of .

A voice conversion method for converting a dynamic feature of a speaker's voice feature, the method comprising:
With the voice conversion device,
learning a speech conversion model that converts the speech features of the conversion source speaker into the speech features of the conversion target speaker;
inputting the voice features of the conversion source speaker into a trained voice conversion model and converting them into the voice features of the conversion target speaker;
Using the dynamic feature amount of the voice feature amount of the conversion source speaker and the dynamic feature amount of the voice feature amount of the conversion target speaker, the voice feature amount of the conversion source speaker is converted into a post-conversion voice feature amount. and the step of converting it to
generating a speech waveform from the converted speech feature amount;
Voice conversion methods including.

A program for causing a computer to function as the speech conversion device according to any one of claims 1 to 4.