JP2023005191A

JP2023005191A - Voice processing learning program, voice processing learning device, voice processing learning method, voice processing program, voice processor and voice processing method

Info

Publication number: JP2023005191A
Application number: JP2021106955A
Authority: JP
Inventors: 圭阿久澤; Kei Akuzawa; 弘太郎大西; Kotaro Onishi; 啓介滝口; Keisuke Takiguchi; 浩輝豆谷; Koki Mametani; 紘一郎森; Koichiro Mori
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2023-01-18

Abstract

To appropriately convert voice which an arbitrary speaker utters into sound quality of voice which a target speaker utters.SOLUTION: A device comprises: a voice analysis part 20 which is a sound feature amount extractor for converting voice into an input sound feature amount; a speaker encoder 22 for converting a speaker label of voice into a speaker feature amount; a voice encoder 24 configured by including a variation self-encoder having two or more sampling hierarchies which convert the sound feature amount and the speaker feature amount into latent expression; and a voice decoder 26 configured to include a variation self-encoder having two or more sampling hierarchies which generate the sound feature amount by using at least the latent expression and the speaker feature amount. The voice encoder 24, the voice decoder 26 and the speaker encoder 22 are allowed to learn so that a distance between the sound feature amount inputted to the voice encoder 24 and the sound feature amount generated in the voice decoder 26 becomes small.SELECTED DRAWING: Figure 2

Description

本発明は、音声処理学習プログラム、音声処理学習装置、音声処理学習方法、音声処理プログラム、音声処理装置及び音声処理方法に関する。 The present invention relates to a speech processing learning program, a speech processing learning device, a speech processing learning method, a speech processing program, a speech processing device and a speech processing method.

任意の話者が発声した音声を別の話者の声質を有する音声に変換する音声処理装置が開発されている。例えば、画像変換の技術であるＣｙｃｌｅＧＡＮを音声変換に応用した技術が開示されている（非特許文献１）。 A speech processing device has been developed that converts speech uttered by an arbitrary speaker into speech having the voice quality of another speaker. For example, a technique is disclosed in which CycleGAN, which is an image conversion technique, is applied to audio conversion (Non-Patent Document 1).

Takuhiro Kaneko and Hirokazu Kameoka, Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks arXiv:1711.11293,Nov. 2017 (EUSIPCO 2018) http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc/Takuhiro Kaneko and Hirokazu Kameoka, Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks arXiv:1711.11293, Nov. 2017 (EUSIPCO 2018) http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/ projects/cyclegan-vc/

元の話者から別の話者の音声を合成して出力する音声処理装置では、合成された音声の声質や言い回しをできるだけ自然なものにすることが要求されている。しかしながら、従来の音声処理装置の学習方法では、合成された音声を十分に自然なものとすることができない場合があった。 2. Description of the Related Art In a speech processing device that synthesizes and outputs the speech of another speaker from the original speaker, it is required to make the voice quality and phrasing of the synthesized speech as natural as possible. However, in some cases, the conventional learning method of the speech processing apparatus cannot make the synthesized speech sufficiently natural.

本発明の１つの態様は、コンピュータを、音声を入力音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、入力音響特徴量と話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と話者特徴量を少なくとも用いて音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、を備えた音声処理学習装置として機能させ、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される入力音響特徴量と前記音声デコーダにおいて生成される出力音響特徴量との距離を小さくするように学習させることを特徴とする音声処理学習プログラムである。 According to one aspect of the present invention, a computer comprises an acoustic feature extractor that converts speech into an input acoustic feature, a speaker encoder that converts a speaker label of the speech into a speaker feature, and an input acoustic feature. a speech encoder including a variational autoencoder having two or more sampling hierarchies for converting speaker features into latent representations; and generating acoustic features using at least the latent representations and speaker features. and a speech decoder comprising a variational autoencoder having two or more sampling hierarchies, wherein the speech encoder, the speech decoder, and the speaker encoder are configured to: A speech processing learning program characterized by learning so as to reduce the distance between an input acoustic feature quantity input to a speech encoder and an output acoustic feature quantity generated in the speech decoder.

ここで、前記音声デコーダは、前記２以上のサンプリング階層において話者特徴量を入力する階層が限定されていることが好適である。 Here, it is preferable that the speech decoder has a limited number of layers for inputting the speaker feature amount in the two or more sampling layers.

また、前記音声デコーダは、前記２以上のサンプリング階層において所定の階層より前段の階層には話者特徴量を入力せず、前記所定の階層より後段の階層には話者特徴量を入力することが好適である。 Further, the speech decoder does not input the speaker feature amount to a layer preceding a predetermined layer in the two or more sampling layers, and inputs the speaker feature amount to a layer following the predetermined layer. is preferred.

また、前記音声デコーダは、前記２以上のサンプリング階層において前記所定の階層より前段の階層では事後分布からサンプリングを行い、前記所定の階層より後段の階層では事前分布からサンプリングを行うことが好適である。 In addition, it is preferable that the audio decoder performs sampling from the posterior distribution in the hierarchy preceding the predetermined hierarchy in the two or more sampling hierarchies, and performs sampling from the prior distribution in the hierarchy following the predetermined hierarchy. .

また、前記音声デコーダは、話者特徴量を条件付きインスタンス正規化層に入力することが好適である。 Moreover, it is preferable that the speech decoder inputs the speaker feature amount to a conditional instance normalization layer.

本発明の別の態様は、コンピュータを、音声を音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、前記音響特徴量抽出器においてソース話者の音声を変換して得られたソース音響特徴量と、前記話者エンコーダにおいて前記ソース話者の話者ラベルを変換して得られたソース話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と、前記話者エンコーダにおいてターゲット話者の話者ラベルを変換して得られたターゲット話者特徴量を少なくとも用いてターゲット音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、前記音声デコーダで生成された前記ターゲット音響特徴量を音声に変換するボコーダと、を備える音声処理装置として機能させることを特徴とする音声処理プログラムである。 According to another aspect of the present invention, a computer comprises an acoustic feature extractor that converts speech into an acoustic feature, a speaker encoder that converts a speaker label of speech into a speaker feature, and the acoustic feature extractor. and the source speaker feature obtained by converting the speaker label of the source speaker in the speaker encoder into a latent representation. a latent representation and a target speaker obtained by transforming the speaker label of the target speaker in the speaker encoder. a speech decoder including a variational autoencoder having two or more sampling hierarchies that generates target acoustic features using at least features; A voice processing program characterized by functioning as a voice processing device comprising a vocoder that converts into

ここで、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される音響特徴量と前記音声デコーダにおいて生成される音響特徴量との距離を小さくするように学習させたものであることを特徴とする音声処理プログラム。 Here, the speech encoder, the speech decoder, and the speaker encoder are learned so as to reduce the distance between the acoustic features input to the speech encoder and the acoustic features generated by the speech decoder. A voice processing program characterized by:

また、前記音声デコーダは、前記２以上のサンプリング階層において前記ターゲット話者特徴量を入力する階層が限定されていることが好適である。 Further, it is preferable that the speech decoder has a limited number of layers for inputting the target speaker feature amount in the two or more sampling layers.

また、前記音声デコーダは、前記２以上のサンプリング階層において所定の階層より前段の階層には前記ターゲット話者特徴量を入力せず、前記所定の階層より後段の階層には前記ターゲット話者特徴量を入力することが好適である。 Further, the speech decoder does not input the target speaker feature quantity to a hierarchy preceding a predetermined hierarchy among the two or more sampling hierarchies, and inputs the target speaker feature quantity to a hierarchy succeeding the predetermined hierarchy. is preferably entered.

また、前記音声デコーダは、前記ターゲット話者特徴量を条件付きインスタンス正規化層に入力することが好適である。 Also, it is preferable that the speech decoder inputs the target speaker features to a conditional instance normalization layer.

本発明の別の態様は、音声を入力音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、音響特徴量と話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と話者特徴量を少なくとも用いて音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、を備え、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される入力音響特徴量と前記音声デコーダにおいて生成される出力音響特徴量との距離を小さくするように学習させることを特徴とする音声処理学習装置である。 Another aspect of the present invention is an acoustic feature extractor that converts speech into input acoustic features, a speaker encoder that converts speaker labels of speech into speaker features, and acoustic features and speaker features. a speech encoder comprising a variational autoencoder having two or more sampling hierarchies that transforms into latent representations, and two or more speech encoders that generate acoustic features using at least the latent representations and speaker features a speech decoder comprising a variational autoencoder having a sampling hierarchy, wherein the speech encoder, the speech decoder and the speaker encoder are adapted to input acoustic features input to the speech encoder and the A speech processing learning device characterized by learning so as to reduce the distance from an output acoustic feature quantity generated in a speech decoder.

本発明の別の態様は、音声を音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、前記音響特徴量抽出器においてソース話者の音声を変換して得られたソース音響特徴量と、前記話者エンコーダにおいて前記ソース話者の話者ラベルを変換して得られたソース話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と、前記話者エンコーダにおいてターゲット話者の話者ラベルを変換して得られたターゲット話者特徴量を少なくとも用いてターゲット音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、前記音声デコーダで生成された前記ターゲット音響特徴量を音声に変換するボコーダと、を備えることを特徴とする音声処理装置である。 Another aspect of the present invention is an acoustic feature extractor that converts speech into acoustic features, a speaker encoder that converts speaker labels of speech into speaker features, and a source speech in the acoustic feature extractor. converting a source acoustic feature obtained by converting a speaker's speech and a source speaker feature obtained by converting a speaker label of the source speaker in the speaker encoder into a latent expression. a speech encoder comprising a variational autoencoder having a sampling hierarchy of , a latent representation, and a target speaker feature obtained by transforming the speaker label of the target speaker in the speaker encoder, a speech decoder comprising: a variational autoencoder having two or more sampling hierarchies for generating target acoustic features using at least; and converting the target acoustic features generated by the speech decoder into speech. and a vocoder.

本発明の別の態様は、音声を入力音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、音響特徴量と話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と話者特徴量を少なくとも用いて音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、を備える音声処理学習装置において、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される入力音響特徴量と前記音声デコーダにおいて生成される出力音響特徴量との距離を小さくするように学習させることを特徴とする音声処理学習方法である。 Another aspect of the present invention is an acoustic feature extractor that converts speech into input acoustic features, a speaker encoder that converts speaker labels of speech into speaker features, and acoustic features and speaker features. a speech encoder comprising a variational autoencoder having two or more sampling hierarchies that transforms into latent representations, and two or more speech encoders that generate acoustic features using at least the latent representations and speaker features and a speech decoder comprising a variational autoencoder having a sampling hierarchy, wherein the speech encoder, the speech decoder and the speaker encoder are inputs to the speech encoder. A speech processing learning method characterized by learning so as to reduce the distance between an acoustic feature quantity and an output acoustic feature quantity generated in the speech decoder.

本発明の別の態様は、音声を音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、前記音響特徴量抽出器においてソース話者の音声を変換して得られたソース音響特徴量と、前記話者エンコーダにおいて前記ソース話者の話者ラベルを変換して得られたソース話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と、前記話者エンコーダにおいてターゲット話者の話者ラベルを変換して得られたターゲット話者特徴量を少なくとも用いてターゲット音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、前記音声デコーダで生成された前記ターゲット音響特徴量を音声に変換するボコーダと、を備える音声処理装置を用いて、前記ソース話者の音声を前記ターゲット話者の音声に変換することを特徴とする音声処理方法である。 Another aspect of the present invention is an acoustic feature extractor that converts speech into acoustic features, a speaker encoder that converts speaker labels of speech into speaker features, and a source speech in the acoustic feature extractor. converting a source acoustic feature obtained by converting a speaker's speech and a source speaker feature obtained by converting a speaker label of the source speaker in the speaker encoder into a latent expression. a speech encoder comprising a variational autoencoder having a sampling hierarchy of , a latent representation, and a target speaker feature obtained by transforming the speaker label of the target speaker in the speaker encoder, a speech decoder comprising: a variational autoencoder having two or more sampling hierarchies for generating target acoustic features using at least; and converting the target acoustic features generated by the speech decoder into speech. and a vocoder, converting the speech of the source speaker into the speech of the target speaker.

本発明によれば、任意の話者が発した音声を目標とする話者が発した音声に適切に変換する音声処理学習プログラム、音声処理学習装置、音声処理学習方法、音声処理学習プログラム、音声処理学習装置及び音声処理学習方法を提供することができる。本発明の実施の形態の他の目的は、本明細書全体を参照することにより明らかになる。 According to the present invention, a speech processing learning program, a speech processing learning device, a speech processing learning method, a speech processing learning program, and speech for appropriately converting speech uttered by an arbitrary speaker into speech uttered by a target speaker. A processing learning device and a speech processing learning method can be provided. Other objects of embodiments of the present invention will become apparent by reference to the specification as a whole.

本発明の実施の形態における音声処理装置の構成を示す図である。It is a figure which shows the structure of the speech processing apparatus in embodiment of this invention. 本発明の実施の形態における音声処理学習装置の構成を示す機能ブロック図である。1 is a functional block diagram showing the configuration of a speech processing learning device according to an embodiment of the present invention; FIG. バリエーショナル・オート－エンコーダの構成を示す図である。Fig. 2 shows the configuration of a variational auto-encoder; ヌーヴォー・バリエーショナル・オート－エンコーダの構成を示す図である。Fig. 3 shows the structure of the Nouveau Variational Auto-Encoder; 本発明の実施の形態におけるバリエーショナル・オート－エンコーダの各層のニューラルネットワークの構成を示す図である。FIG. 4 is a diagram showing the configuration of a neural network of each layer of a variational auto-encoder in an embodiment of the present invention; 本発明の実施の形態における音声学習処理を説明するための図である。FIG. 4 is a diagram for explaining speech learning processing according to the embodiment of the present invention; FIG. 本発明の実施の形態における音声学習装置の構成を示す機能ブロック図である。1 is a functional block diagram showing the configuration of a speech learning device according to an embodiment of the present invention; FIG. 本発明の実施の形態における音声処理を説明するための図である。FIG. 4 is a diagram for explaining audio processing in the embodiment of the present invention; FIG.

本発明の実施の形態における音声処理装置１００は、図１に示すように、処理部１０、記憶部１２、入力部１４、出力部１６及び通信部１８を含んで構成される。処理部１０は、ＣＰＵ等の演算処理を行う手段を含む。処理部１０は、記憶部１２に記憶されている音声処理学習プログラムを実行することによって、本実施の形態における音声処理の学習を行う。また、処理部１０は、記憶部１２に記憶されている音声処理プログラムを実行することによって、本実施の形態における音声処理に関する機能を実現する。記憶部１２は、半導体メモリやメモリカード等の記憶手段を含む。記憶部１２は、処理部１０とアクセス可能に接続され、音声処理学習プログラム、音声処理プログラム、その処理に必要な情報を記憶する。入力部１４は、情報を入力する手段を含む。入力部１４は、例えば、使用者からの情報の入力を受けるキーボード、タッチパネル、ボタン等を備える。また、入力部１４は、任意の話者及び目標となる所定の話者の音声の入力を受ける音声入力手段を備える。音声入力手段は、例えば、マイク、増幅回路等を含む構成とすればよい。出力部１６は、管理者から入力情報を受け付けるためのユーザインターフェース画面（ＵＩ）や処理結果を出力する手段を含む。出力部１６は、例えば、画像を呈示するディスプレイを備える。また、出力部１６は、音声処理装置１００によって生成された合成音声を出力する音声出力手段を備える。音声出力手段は、例えば、スピーカ、増幅器等を含む構成とすればよい。通信部１８は、ネットワーク１０２を介して、外部端末（図示しない）との情報の通信を行うインターフェースを含んで構成される。通信部１８による通信は有線及び無線を問わない。なお、音声処理に供される音声情報は通信部１８を介して外部端末から取得してもよい。 A speech processing apparatus 100 according to the embodiment of the present invention includes a processing unit 10, a storage unit 12, an input unit 14, an output unit 16, and a communication unit 18, as shown in FIG. The processing unit 10 includes means for performing arithmetic processing such as a CPU. The processing unit 10 learns speech processing according to the present embodiment by executing the speech processing learning program stored in the storage unit 12 . Further, the processing unit 10 implements the functions related to the sound processing in this embodiment by executing the sound processing program stored in the storage unit 12 . The storage unit 12 includes storage means such as a semiconductor memory and a memory card. The storage unit 12 is connected to the processing unit 10 so as to be accessible, and stores a speech processing learning program, a speech processing program, and information necessary for the processing. The input unit 14 includes means for inputting information. The input unit 14 includes, for example, a keyboard, a touch panel, buttons, etc. for receiving information input from the user. The input unit 14 also includes voice input means for receiving input of voices of an arbitrary speaker and a predetermined target speaker. The audio input means may be configured to include, for example, a microphone, an amplifier circuit, and the like. The output unit 16 includes means for outputting a user interface screen (UI) for accepting input information from an administrator and processing results. The output unit 16 includes, for example, a display that presents images. The output unit 16 also includes a voice output means for outputting the synthesized voice generated by the voice processing device 100 . The audio output means may have a configuration including, for example, a speaker, an amplifier, and the like. The communication unit 18 includes an interface for communicating information with an external terminal (not shown) via the network 102 . Communication by the communication unit 18 may be wired or wireless. It should be noted that the audio information used for audio processing may be obtained from an external terminal via the communication unit 18 .

音声処理装置１００は、任意の話者が発した音声を所定の話者（目標話者）の音声の音質に変換する音声処理を行う。また、音声処理装置１００は、当該音声処理のための学習を行う音声処理学習装置としても機能する。 The speech processing device 100 performs speech processing for converting speech uttered by an arbitrary speaker into the sound quality of the speech of a predetermined speaker (target speaker). The speech processing device 100 also functions as a speech processing learning device that performs learning for the speech processing.

［音声学習処理］
図２は、音声処理学習時における音声処理装置１００の構成を示す機能ブロック図である。音声処理装置１００は、音声分析部２０、話者エンコーダ２２、音声エンコーダ２４、音声デコーダ２６及び学習器２８として機能する。具体的には、音声処理装置１００は、音声処理学習プログラムを実行することによって以下の音声学習方法を実現する音声処理学習装置として機能する。 [Voice learning process]
FIG. 2 is a functional block diagram showing the configuration of the speech processing device 100 during speech processing learning. The speech processing device 100 functions as a speech analysis unit 20 , a speaker encoder 22 , a speech encoder 24 , a speech decoder 26 and a learner 28 . Specifically, the speech processing device 100 functions as a speech processing learning device that realizes the following speech learning method by executing a speech processing learning program.

音声分析部２０は、音声データを取得し、音声データから音響特徴量を抽出する音響特徴量抽出器として機能する。すなわち、音声処理装置１００の処理部１０は、音声分析部２０として機能する。音声データは、入力部１４を構成するマイクを用いて話者の音声を音声データに変換して取得すればよい。また、通信部１８を介して、外部のコンピュータ等に予め記録されている音声データを受信するようにしてもよい。取得された音声データは、記憶部１２に記憶される。 The speech analysis unit 20 acquires speech data and functions as an acoustic feature extractor that extracts acoustic features from the speech data. That is, the processing unit 10 of the speech processing device 100 functions as the speech analysis unit 20 . The voice data may be obtained by converting the voice of the speaker into voice data using the microphone that constitutes the input unit 14 . Also, voice data prerecorded in an external computer or the like may be received via the communication unit 18 . The acquired voice data is stored in the storage unit 12 .

音声データの取得処理は、任意の話者の発する音声について行われる。音声学習処理では、多数の話者からの音声を用いて音声エンコーダ２４及び音声デコーダ２６の学習処理が行われる。各話者から得る音声は、同一の内容である必要はない。 Acquisition processing of speech data is performed for speech uttered by an arbitrary speaker. In the speech learning process, the speech encoder 24 and the speech decoder 26 are trained using speech from many speakers. The speech obtained from each speaker need not have identical content.

また、音声分析部２０は、さらに音声処理に必要な音声分析を行う。例えば、音声分析部２０は、入力された音声の周波数特性に基づいて音声のケプストラム解析を行い、スペクトルの包絡線（声の太さ等を示す情報）及び微細構造の情報を含むメル周波数ケプストラム係数（ＭＦＣＣ）、音声の基本周波数や共鳴周波数（声の高さ、声のかすれ等を示す情報）等の音響特徴量を求める。音響特徴量は、例えば、音声セグメントの長さＴに対して（８０×Ｔ）次元のユークリッド空間とすることができる。具体的には、音声分析部２０は、話者ＩＤ（話者ラベル）がｉの話者が発した音声から音響特徴量ｘ_ｉを生成して出力する。音声分析部２０で抽出された音響特徴量は音声エンコーダ２４及び学習器２８へ入力される。 In addition, the speech analysis unit 20 further performs speech analysis necessary for speech processing. For example, the speech analysis unit 20 performs cepstrum analysis of the speech based on the frequency characteristics of the input speech, and the mel-frequency cepstrum coefficient including the spectrum envelope (information indicating the thickness of the voice, etc.) and the fine structure information. (MFCC), fundamental frequency and resonance frequency of voice (information indicating pitch of voice, hoarseness of voice, etc.) and other acoustic feature quantities are obtained. The acoustic features can be, for example, an (80×T) dimensional Euclidean space for the length T of the speech segment. Specifically, the speech analysis unit 20 generates and outputs the acoustic feature quantity _xi from the speech uttered by the speaker whose speaker ID (speaker label) is i. The acoustic features extracted by the speech analysis unit 20 are input to the speech encoder 24 and the learning device 28 .

話者エンコーダ２２は、音声分析部２０に入力された音声の発話者のＩＤを音声処理に利用できる話者特徴量に変換して出力する。話者エンコーダ２２は、発話者のＩＤを話者特徴量に変換して出力する埋込モジュールを含んで構成することができる。例えば、話者エンコーダ２２は、話者ＩＤがｉの話者である場合、話者特徴量ｙ_ｉを生成して出力する。話者エンコーダ２２で生成された話者特徴量は音声エンコーダ２４及び音声デコーダ２６へ入力される。 The speaker encoder 22 converts the ID of the speaker of the speech input to the speech analysis unit 20 into a speaker feature quantity that can be used for speech processing, and outputs the speaker feature quantity. The speaker encoder 22 can include an embedding module that converts the speaker's ID into a speaker feature quantity and outputs it. For example, if the speaker ID is i, the speaker encoder 22 generates and outputs the speaker feature quantity _yi . The speaker feature amount generated by the speaker encoder 22 is input to the speech encoder 24 and speech decoder 26 .

音声処理装置１００の学習では、複数の話者が発した音声から得られた音響特徴量ｘ_ｉと話者特徴量ｙ_ｉの組み合わせ（ｘ_ｉ，ｙ_ｉ）のセットが用いられる。 In the training of the speech processing apparatus 100, a set of combinations ( _xi , _yi ) of the acoustic feature quantity _xi and the speaker feature quantity _yi obtained from speech uttered by a plurality of speakers is used.

音声エンコーダ２４は、音響特徴量及び話者特徴量の入力を受けて、音響特徴量及び話者特徴量を潜在表現に変換する処理を行う。音声デコーダ２６は、音声エンコーダ２４によって得られた潜在表現及び話者特徴量の入力を受けて、潜在表現及び話者特徴量を音響特徴量に変換する処理を行う。潜在表現は、入力された音声データの言語的な特徴を表す。 The speech encoder 24 receives input of the acoustic feature amount and the speaker feature amount, and performs processing for converting the acoustic feature amount and the speaker feature amount into latent expressions. The speech decoder 26 receives the input of the latent expression and the speaker's feature quantity obtained by the speech encoder 24, and converts the latent expression and the speaker's feature quantity into an acoustic feature quantity. A latent expression represents the linguistic features of the input speech data.

音声エンコーダ２４及び音声デコーダ２６は、図２に示すように、音声分析部２０から音響特徴量ｘ_ｉの入力を受けて、音声エンコーダ２４において潜在表現ｚに変換し、さらに音声デコーダ２６において潜在表現ｚから音響特徴量ｘ_ｉ＾に再構成し、出力の音響特徴量ｘ_ｉ＾が入力の音響特徴量ｘ_ｉを復元するように学習される。 As shown in FIG. 2, the speech encoder 24 and the speech decoder 26 receive the acoustic feature quantity _xi from the speech analysis unit 20, convert it into a latent expression z in the speech encoder 24, and convert it into a latent expression z in the speech decoder 26. Acoustic feature quantity x _i ̂ is reconstructed from z, and the output acoustic feature quantity x _i ̂ is learned to restore the input acoustic feature quantity x _i .

本実施の形態では、音声エンコーダ２４及び音声デコーダ２６は、バリエーショナル・オート－エンコーダ（ＶＡＥ：ＶａｒｉａｔｉｏｎａｌＡｕｔｏ－Ｅｎｃｏｄｅｒ）によって構成される。バリエーショナル・オート－エンコーダは、変分自己符号化器の一種であり、図３に示すように、潜在表現を確率分布に基づいたサンプリングによって生成する。確率分布は、平均μと分散σで規定される正規分布と仮定する。バリエーショナル・オート－エンコーダは、入力Ｘに対して平均μと分散σに基づいたサンプリングによって潜在表現ｚを生成するエンコーダと、潜在表現ｚから出力Ｘ＾を生成するデコーダと、の組み合わせからなる。バリエーショナル・オート－エンコーダでは、入力Ｘと出力Ｘ＾との復元誤差（復元距離）Ｅが小さくなるように話者エンコーダ２２、音声エンコーダ２４及び音声デコーダ２６の学習が行われる。 In this embodiment, audio encoder 24 and audio decoder 26 are implemented by Variational Auto-Encoders (VAEs). A variational auto-encoder is a type of variational auto-encoder that generates a latent representation by sampling based on a probability distribution, as shown in FIG. The probability distribution is assumed to be a normal distribution defined by mean μ and variance σ. A variational auto-encoder consists of a combination of an encoder that produces a latent representation z by sampling an input X based on the mean μ and the variance σ, and a decoder that produces the output X̂ from the latent representation z. In the variational auto-encoder, the speaker encoder 22, the speech encoder 24, and the speech decoder 26 are trained so that the restoration error (reconstruction distance) E between the input X and the output X̂ becomes small.

図４に示すように、一般的なバリエーショナル・オート－エンコーダは一階層のニューラルネットワークで構成されるが、本実施の形態では２階層以上の複数階層のニューラルネットワークで構成されたヌーヴォー・バリエーショナル・オート－エンコーダ（ＮＶＡＥ：ＮｏｕｖｅａｕＶａｒｉａｔｉｏｎａｌＡｕｔｏ－Ｅｎｃｏｄｅｒ）とすることが好適である。すなわち、ヌーヴォー・バリエーショナル・オート－エンコーダは、２以上のサンプリング階層を有する変分自己符号化器を含んで構成される。例えば、音声処理装置１００では、音声エンコーダ２４及び音声デコーダ２６をｎ＝３５階層のニューラルネットワークでそれぞれ構成することが好適である。 As shown in FIG. 4, a general variational auto-encoder is composed of a single-layer neural network, but in this embodiment, a nouveau variational auto-encoder is composed of two or more hierarchical neural networks. • It is preferably an auto-encoder (NVAE: Nouveau Variational Auto-Encoder). That is, the Nouveau Variational Auto-encoder consists of a variational autoencoder with two or more sampling hierarchies. For example, in the speech processing device 100, it is preferable to configure the speech encoder 24 and the speech decoder 26 with n=35 hierarchical neural networks, respectively.

音声エンコーダ２４及び音声デコーダ２６のヌーヴォー・バリエーショナル・オート－エンコーダの各層は、図５に示すように、Ｃｏｎｄｉｔｉｏｎａｌ－Ｉｎｓｔａｎｃｅ－Ｎｏｒｍａｌｉｚａｔｉｏｎ層（ＣＩＮ層）、Ｃｏｎｖｏｌｕｔｉｏｎ層（ＣＯＮＶ層）、Ｓｑｕｅｅｚｅ－ａｎｄ－Ｅｘｃｉｔａｔｉｏｎ層（ＳＥ層）を組み合わせて構成される。ＣＩＮ層は、一般的なヌーヴォー・バリエーショナル・オート－エンコーダにおけるバッチ正規仮想（ＢＮ層）の代わりに設けられる層である。ＣＩＮ層は、正規化層の１つであり、スタイル毎に異なるパラメータを設定して正規化を行う条件付きインスタンス正規化層である。本実施の形態では、ＣＩＮ層は、話者特徴量を入力の１つとして、入力された話者特徴量によって条件付けられた正規化を行う。また、Ｓｗｉｓｈ活性化関数はｆ（ｘ）＝ｘ／（１＋ｅ^－βｘ）と表される活性化関数である。Ｃｏｎｖｏｌｕｔｉｏｎ層は、入力に対して畳み込み演算を適用して次の層に演算結果を出力する層である。ＳＥ層は、入力に対してチャンネル間の関係に基づいて適応的にａｔｔｅｎｔｉｏｎをかけて重み付きの特徴を出力する層である。 Each layer of the Nouveau Variational Auto-Encoder of the audio encoder 24 and the audio decoder 26 includes a Conditional-Instance-Normalization layer (CIN layer), a Convolution layer (CONV layer), a Squeeze-and-Excitation layer, as shown in FIG. It is configured by combining layers (SE layers). The CIN layer is a layer that replaces the batch normal virtual (BN layer) in the common Nouveau Variational Auto-Encoder. The CIN layer is one of the normalization layers, and is a conditional instance normalization layer that performs normalization by setting different parameters for each style. In this embodiment, the CIN layer takes a speaker feature as one input and performs normalization conditioned by the input speaker feature. Also, the Swish activation function is an activation function expressed as f(x)=x/(1+e ^−βx ). A convolution layer is a layer that applies a convolution operation to an input and outputs the operation result to the next layer. The SE layer is a layer that adaptively applies attention to the input based on the relationship between channels and outputs weighted features.

図６を参照して、音声処理装置１００における音声学習処理について説明する。音声エンコーダ２４及び音声デコーダ２６は、それぞれ階層数ｎのニューラルネットワークで構成された例を示している。階層数ｎは、例えば、３５階層とすることができる。各階層は、それぞれ図５に示したＣｏｎｄｉｔｉｏｎａｌ－Ｉｎｓｔａｎｃｅ－Ｎｏｒｍａｌｉｚａｔｉｏｎ層（ＣＩＮ層）、Ｃｏｎｖｏｌｕｔｉｏｎ層（ＣＯＮＶ層）、Ｓｑｕｅｅｚｅ－ａｎｄ－Ｅｘｃｉｔａｔｉｏｎ層（ＳＥ層）を組み合わせて構成される。なお、音声エンコーダ２４の階層ｋ（ただし、ｋは１～ｎの階層数を示す）から出力される潜在表現をｈ_ｋで示し、音声デコーダ２６の階層数ｋで表される階層から出力される潜在表現をｚ_ｋで示している。 The speech learning process in speech processing device 100 will be described with reference to FIG. The speech encoder 24 and the speech decoder 26 each show an example configured by a neural network with the number of layers n. The hierarchy number n can be set to 35 hierarchies, for example. Each layer is configured by combining the Conditional-Instance-Normalization layer (CIN layer), the Convolution layer (CONV layer), and the Squeeze-and-Excitation layer (SE layer) shown in FIG. Note that the latent expression output from the layer k of the audio encoder 24 (where k represents the number of layers from 1 to n) is denoted by hk, and the latent expression output from the layer represented by the number of layers _k of the audio decoder 26 is We denote the latent representation by _zk .

音声エンコーダ２４では、階層ｎに対して音響特徴量ｘ_ｉ及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_ｎが出力される。次の階層ｎ－１では、前段である階層ｎから出力された潜在表現ｈ_ｎ及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_ｎ－１が出力される。以下、同様に、階層ｋでは、前段である階層ｋ＋１から出力された潜在表現ｈ_ｋ＋１及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_ｋが出力される。最終段である階層１では、前段である階層２から出力された潜在表現ｈ_２及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_１が出力される。当該潜在表現ｈ_１から音声デコーダ２６の初段である階層１の潜在表現ｚ_１がサンプリングされる。このように、音声エンコーダ２４においては、すべての階層１～ｎにおいて話者特徴量ｙ_ｉを入力に含めることが好適である。 In the speech encoder 24, the acoustic feature amount _xi and the speaker feature amount _yi are input for the layer _n , and the latent expression hn is output. At the next layer n-1, the latent expression h _n and the speaker feature quantity y _i output from the preceding layer n are input, and the latent expression h _n-1 is output. Similarly, at layer k, the latent expression _hk+1 and the speaker feature quantity _yi output from the preceding layer _k +1 are input, and the latent expression hk is output. At the final level ₁ , the latent expression h2 and the speaker feature quantity _yi output from the preceding level ₂ are input, and the latent expression h1 is output. A latent expression z1 of layer ₁ , which is the _first stage of the speech decoder 26, is sampled from the latent expression h1. In this way, in the speech encoder 24, it is preferable to include the speaker feature quantity _yi in the input in all layers 1 to n.

音声デコーダ２６では、初段である階層１に対して潜在表現ｚ_１が入力され、潜在表現ｚ_２が出力される。また、音声デコーダ２６の階層ｋにおける潜在表現ｚ_ｋは、音声デコーダ２６において前段の階層ｋ－１の潜在表現ｚ_ｋ－１、音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋ及び話者特徴量ｙ_ｉに基づく事前分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｈ_ｋ，ｙ_ｉ）からサンプリングして得ることが可能である。また、潜在表現ｚ_ｋは、音声デコーダ２６のより前段の階層ｋ－１、階層ｋ－２・・・階層１の潜在表現ｚ_ｋ－１、潜在表現ｚ_ｋ－２・・・潜在表現ｚ_１及び音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋに基づく事後分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｚ_ｋ－２・・・ｚ_１，ｈ_ｋ）からサンプリングして得ることも可能である。なお、分布ｐ（ａ｜ｂ）は、ｂを前提条件としてａが出力とされる尤もらしさを示す尤度関数である。 In the audio decoder 26, the latent expression z1 is input to the hierarchy ₁ , which is the first stage, and the latent expression z2 is _output . In addition, the latent expression z _k in the layer k of the speech decoder 26 is obtained from the latent expression z _k −1 in the preceding layer k−1 in the speech decoder 26, the latent expression h _k in the k-th layer in the speech encoder 24, and the speaker feature amount. It can be obtained by sampling from the prior distribution p(z _k |z _k−1 , h _k , y _i ) based on y _i . Also, the latent expression z _k is the layer k− ₁ , layer k ₋ ₂ , . And it is also possible to obtain by sampling from the posterior distribution p (z _k | z _k-1 , z _k-2 ... z ₁ , h _k ) based on the k-th layer latent expression h _k of the speech encoder 24 . Note that the distribution p(a|b) is a likelihood function that indicates the likelihood that a will be output with b as a precondition.

音声学習処理では、音声デコーダ２６の出力に近い階層から遠い階層に亘って音声エンコーダ２４からサンプリングを行う。すなわち、図６に示すように、すべての階層１～階層ｎにおいて音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋからサンプリングを行うことが好適である。また、事後分布からのサンプリングには話者特徴量ｙ_ｉを入力に含めないことが好適である。 In the speech learning process, sampling is performed from the speech encoder 24 over layers close to and far from the output of the speech decoder 26 . That is, as shown in FIG. 6, it is preferable to sample from the k-th layer latent expression h _k of the speech encoder 24 in all layers 1 to n. Moreover, it is preferable not to include the speaker feature quantity y _i in the input for sampling from the posterior distribution.

すなわち、音声デコーダ２６では出力に近い階層のみに話者特徴量ｙ_ｉを入力に含め、出力から遠い階層には話者特徴量ｙ_ｉを入力に含めないことが好適である。このとき、音声エンコーダ２４からサンプリングを行わず、事前分布からサンプリングを行う階層では話者特徴量ｙ_ｉを入力に含め、音声エンコーダ２４からサンプリングを行い、事後分布からサンプリングを行う階層では話者特徴量ｙ_ｉを入力に含めないようにすることが好適である。 That is, it is preferable that the speech decoder 26 includes the speaker feature _yi as input only in the hierarchy near the output, and does not include the speaker feature _yi as the input in the hierarchy far from the output. At this time, the speaker feature quantity y _i is included in the input in the layer in which sampling is not performed from the speech encoder 24 but from the prior distribution, and the speaker feature is It is preferred not to include the quantity y _i in the input.

なお、サンプリングには話者特徴量ｙ_ｉを含めない階層では、Ｃｏｎｄｉｔｉｏｎａｌ－Ｉｎｓｔａｎｃｅ－Ｎｏｒｍａｌｉｚａｔｉｏｎ層（ＣＩＮ層）に話者特徴量ｙ_ｉを入力しない。 Note that, in layers in which the speaker feature y _{i is not included in the sampling, the speaker feature y i} _is not input to the Conditional-Instance-Normalization layer (CIN layer).

このような構成において、学習器２８では、音声デコーダ２６に入力される音響特徴量ｘ_ｉと音声デコーダ２６から出力される再構築された音響特徴量ｘ_ｉ＾との誤差（距離）が小さくなるように話者エンコーダ２２、音声エンコーダ２４及び音声デコーダ２６に含まれる各階層のニューラルネットワークの各種パラメータ（各ニューロンの重み係数又はバイアス等）を調整する。 With such a configuration, in the learning device 28, the error (distance) between the acoustic feature quantity x _i input to the speech decoder 26 and the reconstructed acoustic feature quantity x _i ^ output from the speech decoder 26 becomes small. Various parameters (weight coefficients or biases of each neuron, etc.) of the neural networks in each layer included in the speaker encoder 22, speech encoder 24, and speech decoder 26 are adjusted as follows.

ここで、音声デコーダ２６に入力される音響特徴量ｘ_ｉと音声デコーダ２６から出力される再構築された音響特徴量ｘ_ｉ＾との誤差（距離）が小さくなるように、音声デコーダ２６において話者特徴量ｙ_ｉを考慮した事前分布からサンプリングを行う階層と、話者特徴量ｙ_ｉを考慮しない事後分布からサンプリングを行う階層との境界となる階層を適宜設定すればよい。 Here, the speech decoder 26 is configured so that the error (distance) between the acoustic feature quantity x _i input to the speech decoder 26 and the reconstructed acoustic feature quantity x _i ^ output from the speech decoder 26 is small. It suffices to appropriately set a layer that serves as a boundary between a layer in which sampling is performed from the prior distribution considering the speaker feature y _i and a layer in which sampling is performed from the posterior distribution that does not consider the speaker feature y _i .

以上のように、音声エンコーダ２４に入力される音響特徴量ｘ_ｉによって表現される音声と、音声デコーダ２６において再構築される音響特徴量ｘ_ｉ＾によって表現される音声とが近づくように音声エンコーダ２４及び音声デコーダ２６が学習される。 As described above, the speech encoder is arranged so that the speech represented by the acoustic feature quantity x _i input to the speech encoder 24 and the speech represented by the acoustic feature quantity x _i ̂ reconstructed in the speech decoder 26 are closer to each other. 24 and speech decoder 26 are trained.

［音声処理］
図７は、ソース話者が発した音声をターゲット話者が発した音声のように変換する音声処理時における音声処理装置１００の構成を示す機能ブロック図である。音声処理装置１００は、音声分析部２０、話者エンコーダ２２、音声エンコーダ２４、音声デコーダ２６及びボコーダ３０として機能する。具体的には、音声処理装置１００は、音声処理プログラムを実行することによって以下の音声処理を実現する音声処理装置として機能する。 [Voice processing]
FIG. 7 is a functional block diagram showing the configuration of the speech processing apparatus 100 during speech processing for converting the speech uttered by the source speaker into the speech uttered by the target speaker. The speech processing device 100 functions as a speech analysis unit 20 , a speaker encoder 22 , a speech encoder 24 , a speech decoder 26 and a vocoder 30 . Specifically, the sound processing device 100 functions as a sound processing device that implements the following sound processing by executing a sound processing program.

音声分析部２０は、ソース話者が発した音声の音声データを取得し、音声処理に必要な音声分析を行う。音声分析部２０で抽出された音響特徴量は音声エンコーダ２４へ入力される。 The speech analysis unit 20 acquires speech data of speech uttered by a source speaker and performs speech analysis necessary for speech processing. The acoustic features extracted by the speech analysis unit 20 are input to the speech encoder 24 .

話者エンコーダ２２は、ソース話者及びターゲット話者のＩＤを音声処理に利用できる話者特徴量に変換して出力する。話者エンコーダ２２は、ソース話者ＩＤがｓの話者である場合、ソース話者特徴量ｙ_ｓを生成して音声エンコーダ２４へ出力する。また、話者エンコーダ２２は、ターゲット話者ＩＤがｔの話者である場合、ターゲット話者特徴量ｙ_ｔを生成して音声デコーダ２６へ出力する。 The speaker encoder 22 converts the IDs of the source speaker and the target speaker into speaker feature quantities that can be used for speech processing and outputs them. If the speaker has a source speaker ID of _s , the speaker encoder 22 generates a source speaker feature quantity ys and outputs it to the speech encoder 24 . Also, when the target speaker ID is the speaker whose ID is _t , the speaker encoder 22 generates the target speaker feature amount yt and outputs it to the speech decoder 26 .

音声エンコーダ２４は、ソース話者の音声から得られた音響特徴量及びソース話者特徴量の入力を受けて、当該音響特徴量及び当該ソース話者特徴量を潜在表現に変換する処理を行う。音声デコーダ２６は、音声エンコーダ２４によって得られた潜在表現及びターゲット話者特徴量の入力を受けて、当該潜在表現及び当該ターゲット話者特徴量から音響特徴量を再構築する処理を行う。 The speech encoder 24 receives the input of the acoustic features obtained from the speech of the source speaker and the source speaker features, and converts the acoustic features and the source speaker features into latent expressions. The speech decoder 26 receives the input of the latent expression and the target speaker feature quantity obtained by the speech encoder 24, and performs processing for reconstructing the acoustic feature quantity from the latent expression and the target speaker feature quantity.

図８を参照して、音声処理装置１００における音声処理について説明する。音声処理では、上記の音声学習処理において学習された音声エンコーダ２４及び音声デコーダ２６を用いて行われる。 Audio processing in the audio processing apparatus 100 will be described with reference to FIG. The speech processing is performed using the speech encoder 24 and the speech decoder 26 trained in the speech learning process described above.

音声エンコーダ２４では、階層ｎに対してソース話者の音声から得られた音響特徴量ｘ_ｓ及びソース話者特徴量ｙ_ｓが入力され、潜在表現ｈ_ｎが出力される。以下、学習時と同様に、階層ｋでは、前段である階層ｋ＋１から出力された潜在表現ｈ_ｋ＋１及びソース話者特徴量ｙ_ｓが入力され、潜在表現ｈ_ｋが出力される。最終段である階層１では、前段である階層２から出力された潜在表現ｈ_２及びソース話者特徴量ｙ_ｓが入力され、潜在表現ｈ_１が出力される。当該潜在表現ｈ_１から音声デコーダ２６の初段である階層１の潜在表現ｚ_１がサンプリングされる。 In the speech encoder 24, the acoustic feature quantity _xs and the source speaker feature quantity _ys obtained from the speech of the source speaker are input to the layer _n , and the latent expression hn is output. Thereafter, in the same manner as during learning, at layer k, the latent expression h _k+1 and the source speaker feature quantity y _s output from the preceding layer k+1 are input, and the latent expression h _k is output. In the last stage, hierarchy ₁ , the latent expression h2 and the source speaker feature quantity _ys output from the preceding stage, hierarchy ₂ , are input, and the latent expression h1 is output. A latent expression z1 of layer ₁ , which is the _first stage of the speech decoder 26, is sampled from the latent expression h1.

音声デコーダ２６では、初段である階層１に対して潜在表現ｚ_１が入力され、潜在表現ｚ_２が出力される。音声デコーダ２６の出力から遠い階層では、ターゲット話者特徴量ｙ_ｔを入力に含めず、音声デコーダ２６においてより前段の階層ｋ－１、階層ｋ－２・・・階層１の潜在表現ｚ_ｋ－１、潜在表現ｚ_ｋ－２・・・潜在表現ｚ_１及び音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋに基づく事後分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｚ_ｋ－２・・・ｚ_１，ｈ_ｋ）からサンプリングを行う。音声デコーダ２６の出力に近い階層では音声エンコーダ２４からサンプリングを行わず、直前の階層ｋ－１の潜在表現ｚ_ｋ－１及びターゲット話者特徴量ｙ_ｔに基づく事前分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｙ_ｔ）からサンプリングを行う。図８では、音声デコーダ２６の階層ｎ－１及び階層ｎにおいて事前分布からサンプリングを行う例を示している。このとき、事前分布からのサンプリングにはソース話者特徴量ｙ_ｓではなく、ターゲット話者特徴量ｙ_ｔを入力に含めることが好適である。 In the audio decoder 26, the latent expression z1 is input to the hierarchy ₁ , which is the first stage, and the latent expression z2 is _output . In the layers far from the output of the speech decoder 26, the target speaker feature quantity yt is not included in the input, and in the speech decoder 26, the latent expressions _zk- in the layers _k -1, k-2, . . . ₁ , latent expression z _k−2 . . . posterior distribution p(z _k |z _k ₋₁ _, z _k−2 . ₁ , h _k ). In the layer close to the output of the speech decoder 26, sampling from the speech encoder 24 is not performed, and the _prior distribution p _{(z k} _| z _k ₋₁ , y _t ). FIG. 8 shows an example of sampling from the prior distribution at layer n−1 and layer n of the audio decoder 26 . At this time, it is preferable to include the target speaker feature _yt , not the source speaker feature _ys , in the input for sampling from the prior distribution.

音声エンコーダ２４及び音声デコーダ２６における音声処理によって、音声デコーダ２６の最終段である階層ｎからソース話者の音声から得られた音響特徴量ｘ_ｓをターゲット話者の音声に合わせた音響特徴量ｘ_ｔが構築されて出力される。 Acoustic feature x _s obtained from the source speaker's speech from layer n, which is the final stage of the speech decoder 26, by speech processing in the speech encoder 24 and the speech decoder 26 is combined with the target speaker's speech to obtain the acoustic feature x _t is constructed and output.

ボコーダ３０は、音声デコーダ２６から出力された音響特徴量ｘ_ｔを音声データに変換して出力する。ボコーダ３０は、音声分析部２０における音声データから音響特徴量を抽出する処理の逆の処理を行うことによって音響特徴量ｘ_ｔを音声データに変換することができる。 The vocoder 30 converts the acoustic feature quantity _xt output from the speech decoder 26 into speech data and outputs the speech data. The vocoder 30 can convert the acoustic feature quantity _xt into speech data by performing the reverse process of extracting the acoustic feature quantity from the speech data in the speech analysis unit 20 .

以上のように、本実施の形態の音声処理装置１００によれば、任意の話者が発した音声を目標とする話者が発した音声の音質に適切に変換する音声処理装置、音声処理プログラム及び音声処理方法並びに音声学習処理装置、音声学習処理プログラム及び音声学習処理方法を提供することができる。すなわち、学習された音声エンコーダ２４及び音声デコーダ２６を含む音声処理装置１００によって、ソース話者が発した音声をターゲット話者が発したような音声に変換する音声処理を実現することができる。 As described above, according to the speech processing device 100 of the present embodiment, a speech processing device and a speech processing program for appropriately converting a speech uttered by an arbitrary speaker into the sound quality of a speech uttered by a target speaker. and a speech processing method, a speech learning processing device, a speech learning processing program, and a speech learning processing method. That is, the speech processing apparatus 100 including the trained speech encoder 24 and the speech decoder 26 can implement speech processing that converts the speech uttered by the source speaker into speech that sounds like the speech uttered by the target speaker.

特に、音声エンコーダ２４及び音声デコーダ２６に対してヌーヴォー・バリエーショナル・オート－エンコーダ（ＮＶＡＥ：ＮｏｕｖｅａｕＶａｒｉａｔｉｏｎａｌＡｕｔｏ－Ｅｎｃｏｄｅｒ）を適用することによって、従来よりもソース話者の音声をターゲット話者が発した自然な感じの音声に変換することができる。 In particular, by applying a Nouveau Variational Auto-Encoder (NVAE) to the speech encoder 24 and the speech decoder 26, the target speaker uttered the source speaker's speech more than before. It can be converted into a natural-sounding voice.

１０処理部、１２記憶部、１４入力部、１６出力部、１８通信部、２０音声分析部、２２話者エンコーダ、２４音声エンコーダ、２６音声デコーダ、２８学習器、３０ボコーダ、１００音声処理装置、１０２ネットワーク。 10 processing unit, 12 storage unit, 14 input unit, 16 output unit, 18 communication unit, 20 speech analysis unit, 22 speaker encoder, 24 speech encoder, 26 speech decoder, 28 learning device, 30 vocoder, 100 speech processing device, 102 network.

Claims

the computer,
an acoustic feature quantity extractor that converts speech into an input acoustic feature quantity;
a speaker encoder that converts speaker labels of speech into speaker features;
a speech encoder comprising a variational autoencoder having two or more sampling hierarchies for converting input acoustic features and speaker features into latent representations;
a speech decoder comprising a variational autoencoder having two or more sampling hierarchies that generates acoustic features using at least latent expressions and speaker features;
function as a speech processing learning device equipped with
The speech encoder, the speech decoder, and the speaker encoder are characterized by learning so as to reduce the distance between an input acoustic feature input to the speech encoder and an output acoustic feature generated in the speech decoder. A speech processing learning program.

The speech processing learning program according to claim 1,
A speech processing learning program, wherein said speech decoder is characterized in that the hierarchy for inputting the speaker feature amount is limited in said two or more sampling hierarchies.

The speech processing learning program according to claim 2,
The speech decoder is characterized in that, among the two or more sampling hierarchies, the speaker feature amount is not input to a hierarchy prior to a predetermined hierarchy, and the speaker feature amount is input to a hierarchy subsequent to the predetermined hierarchy. A speech processing learning program.

A speech processing learning program according to claim 3,
The audio decoder performs sampling from the posterior distribution in the hierarchy preceding the predetermined hierarchy in the two or more sampling hierarchies, and performs sampling from the prior distribution in the hierarchy following the predetermined hierarchy. learning program.

The speech processing learning program according to any one of claims 1 to 4,
A speech processing learning program, wherein the speech decoder inputs the speaker feature amount to a conditional instance normalization layer.

the computer,
an acoustic feature extractor that converts speech into acoustic features;
a speaker encoder that converts speaker labels of speech into speaker features;
A source acoustic feature obtained by converting the speech of the source speaker in the acoustic feature extractor, and a source speaker feature obtained by converting the speaker label of the source speaker in the speaker encoder. a speech encoder comprising a variational autoencoder having two or more sampling hierarchies that transforms into a latent representation;
A variational self-code having two or more sampling hierarchies that generates a target acoustic feature using at least a latent representation and a target speaker feature obtained by transforming a speaker label of the target speaker in the speaker encoder. an audio decoder comprising a decoder;
a vocoder that converts the target acoustic features generated by the speech decoder into speech;
A speech processing program characterized by functioning as a speech processing device comprising:

The speech processing program according to claim 6,
The speech encoder, the speech decoder, and the speaker encoder are trained so as to reduce the distance between the acoustic features input to the speech encoder and the acoustic features generated by the speech decoder. A speech processing program characterized by:

The speech processing program according to claim 6 or 7,
A speech processing learning program, wherein said speech decoder is characterized in that the hierarchy into which said target speaker feature quantity is input is limited in said two or more sampling hierarchies.

The speech processing program according to claim 8,
The speech decoder does not input the target speaker feature quantity to a hierarchy preceding a predetermined hierarchy among the two or more sampling hierarchies, and inputs the target speaker feature quantity to a hierarchy succeeding the predetermined hierarchy. A speech processing program characterized by:

The speech processing program according to claim 9,
The audio decoder performs sampling from the posterior distribution in the hierarchy preceding the predetermined hierarchy in the two or more sampling hierarchies, and performs sampling from the prior distribution in the hierarchy following the predetermined hierarchy. program.

The speech processing program according to any one of claims 6 to 10,
A speech processing program, wherein the speech decoder inputs the target speaker features to a conditional instance normalization layer.

an acoustic feature quantity extractor that converts speech into an input acoustic feature quantity;
a speaker encoder that converts speaker labels of speech into speaker features;
a speech encoder comprising a variational autoencoder having two or more sampling hierarchies for converting acoustic features and speaker features into latent representations;
a speech decoder comprising a variational autoencoder having two or more sampling hierarchies that generates acoustic features using at least latent expressions and speaker features;
with
The speech encoder, the speech decoder, and the speaker encoder are characterized by learning so as to reduce the distance between an input acoustic feature input to the speech encoder and an output acoustic feature generated in the speech decoder. A speech processing learning device.

an acoustic feature extractor that converts speech into acoustic features;
a speaker encoder that converts speaker labels of speech into speaker features;
A source acoustic feature obtained by converting the speech of the source speaker in the acoustic feature extractor, and a source speaker feature obtained by converting the speaker label of the source speaker in the speaker encoder. a speech encoder comprising a variational autoencoder having two or more sampling hierarchies that transforms into a latent representation;
A variational self-code having two or more sampling hierarchies that generates a target acoustic feature using at least a latent representation and a target speaker feature obtained by transforming a speaker label of the target speaker in the speaker encoder. an audio decoder comprising a decoder;
a vocoder that converts the target acoustic features generated by the speech decoder into speech;
A speech processing device comprising:

an acoustic feature quantity extractor that converts speech into an input acoustic feature quantity;
a speaker encoder that converts speaker labels of speech into speaker features;
a speech encoder comprising a variational autoencoder having two or more sampling hierarchies for converting acoustic features and speaker features into latent representations;
a speech decoder comprising a variational autoencoder having two or more sampling hierarchies that generates acoustic features using at least latent expressions and speaker features;
In a speech processing learning device comprising
The speech encoder, the speech decoder, and the speaker encoder are characterized by learning so as to reduce the distance between an input acoustic feature input to the speech encoder and an output acoustic feature generated in the speech decoder. A speech processing learning method that

an acoustic feature extractor that converts speech into acoustic features;
a speaker encoder that converts speaker labels of speech into speaker features;
A source acoustic feature obtained by converting the speech of the source speaker in the acoustic feature extractor, and a source speaker feature obtained by converting the speaker label of the source speaker in the speaker encoder. a speech encoder comprising a variational autoencoder having two or more sampling hierarchies that transforms into a latent representation;
A variational self-code having two or more sampling hierarchies that generates a target acoustic feature using at least a latent representation and a target speaker feature obtained by transforming a speaker label of the target speaker in the speaker encoder. an audio decoder comprising a decoder;
a vocoder that converts the target acoustic features generated by the speech decoder into speech;
using an audio processor comprising
A speech processing method, characterized in that the speech of the source speaker is transformed into the speech of the target speaker.