CN114299917A

CN114299917A - StyleGAN emotion voice conversion method based on fundamental frequency difference compensation

Info

Publication number: CN114299917A
Application number: CN202210004168.4A
Authority: CN
Inventors: 李燕萍; 于杰
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-04-08

Abstract

The invention discloses a fundamental frequency difference compensation-based StyleGAN (style generated adaptive network) emotion voice conversion method, which comprises the steps of firstly, providing a plurality of emotion style characteristics extracted by a style encoder as label information, and fusing the emotion style characteristics and content characteristics by adopting adaptive example normalization, so that a generator can fully learn the style characteristics of target emotion, and can realize conversion between emotions which are not in a training set, namely, conversion between any emotions under an open set condition is completed; furthermore, based on the traditional logarithmic Gaussian normalized base frequency conversion, base frequency difference compensation is provided, so that the amplitude difference between different emotions is enhanced, the problems that the base frequency only rises integrally after the traditional logarithmic Gaussian normalized base frequency conversion is carried out, and the average value and the mean square error cannot accurately describe the base frequency envelope amplitude difference of different emotions are solved, the emotion saturation of the converted voice is effectively improved, and high-quality emotion voice conversion under the open-set condition is realized.

Description

StyleGAN emotion voice conversion method based on fundamental frequency difference compensation

Technical Field

The invention relates to the technical field of voice conversion, in particular to a StyleGAN emotion voice conversion method based on fundamental frequency difference compensation.

Background

The voice conversion is an important research branch in the field of voice signal processing, and aims to convert the individual characteristics of a source speaker into the individual characteristics of a target speaker on the premise of keeping the voice content of the speaker. Speech conveys information through language and prosody, which can not only affect the syntactic and semantic interpretation of speech (linguistic prosody), but can also convey one's emotional state (emotional prosody). The traditional voice conversion mainly focuses on the personality characteristic conversion of a speaker, and less focuses on the prosody conversion of the speaker. The emotion voice conversion is to convert the emotion prosody into the emotion prosody of the target sentence on the premise of keeping the original voice content and the individual characteristics of the speaker unchanged.

The early emotion voice conversion is a conversion result obtained by training spectral mapping between source and target sentences, and j.tao et al propose a method of decomposing a pitch contour of source voice into a hierarchical structure by using a classification regression tree, and then using a Gaussian Mixture Model (GMM) and a regression-based clustering method; then, r.ahara et al propose a sample-based conversion method, which encodes source speech using parallel samples, and synthesizes target speech; ming et al extend the method to a unified sample-based emotion speech conversion framework, while learning the mapping of spectral features and Continuous Wavelet Transform (CWT) -based fundamental frequency features. With the occurrence of Deep learning, the performance of voice conversion is remarkably improved, and methods based on Deep Neural Networks (DNN), Deep Belief Networks (DBN), high-speed Neural networks (Highway networks) and Deep Bidirectional Short Term Memory networks (dblas) are continuously proposed, so that the conversion of spectrum characteristics and prosody characteristics is better realized.

The emotion voice conversion method discussed above is basically based on conversion under parallel corpora, but it is often difficult, time-consuming and labor-consuming to collect a large number of parallel emotion corpora; in addition, in the emotional voice conversion method under the condition of parallel corpora, alignment operation is usually required in the training stage, and additional distortion is inevitably introduced to influence the performance of the conversion model. Therefore, the emotion voice conversion research under the condition of non-parallel texts has greater application value and practical significance in consideration of universality and practicability of the emotion voice conversion system. Therefore, methods have been proposed to relieve the requirement of emotion voice conversion methods for parallel data, such as frameworks based on cyclic Consistent adaptive Network (CycleGAN), Variational Auto-encoder (VAE), Star-generated adaptive Network (StarGAN), etc., but these methods focus on only spectral feature conversion, and for Fundamental frequency feature (F0), conversion is simply performed using traditional logarithmic gaussian normalization. Later, Kun Zhou et al proposed a CycleGAN based on CWT, a conditional Variational Auto-encoder based on CWT, and a method of generating a countermeasure Network (VAWGAN), where F0 was trained by an input model after dimension expansion through CWT, but the conversion function still adopts the conventional logarithmic Gaussian normalization. Later, Georgios Rizos et al proposed an improved logarithmic gaussian normalization fundamental frequency transfer function, which introduces the difference between the mean and mean square error of the source emotion and the target emotion on the basis of the conventional logarithmic gaussian normalization, but when the amplitude of the emotion is changed too much, the mean and mean square error cannot accurately describe the specific fluctuation difference of the fundamental frequency envelope amplitude of the two emotions, and only the amplitude of the source emotion can be integrally processed to the amplitude range of the target emotion.

Further, most of the above mentioned methods are emotion voice conversion in a closed set situation, i.e. corpus of target emotion participates in training, and training data is labeled, in this case, the quality of converted voice is relatively good, however, in an actual application scenario, target emotion may only be a few corpus or only one word participates in training, even does not participate in training, and such problems can be classified as emotion voice conversion in an open set situation, i.e. any emotion voice conversion problem. Under the open-set situation, how to improve the tone quality and emotion saturation of emotion conversion voice becomes a research hotspot and difficulty in the field at present.

Disclosure of Invention

In order to solve the technical problems, the invention provides an emotion voice conversion method of StyleGAN based on fundamental frequency difference compensation, which uses a style encoder to extract emotion style characteristics as tags to participate in training, realizes emotion voice conversion under an open set condition, and solves the problems that the training data of the existing method needs tags and is only suitable for a closed set condition; further, a fundamental frequency difference compensation vector is provided, the problem that after the fundamental frequency conversion is carried out by the traditional logarithmic Gaussian normalization, the converted fundamental frequency only shows integrity rise, and the mean value and the mean square error cannot accurately describe the difference between the amplitudes of different emotions is solved, the emotion saturation of the converted voice is improved, and high-quality emotion voice conversion under the open-set condition is realized.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention firstly provides a StyleGAN emotion voice conversion method based on fundamental frequency difference compensation, which comprises a training stage and a conversion stage, wherein the training stage comprises the following steps:

step 1, obtaining a training corpus, wherein the training corpus consists of corpuses of various emotions of a speaker, and the emotions comprise source emotions and target emotions;

step 2, extracting frequency spectrum characteristics of different emotion voices from the training corpus to serve as acoustic characteristic vectors;

step 3, inputting the obtained acoustic feature vector into the StyleGAN-EVC network for training, and continuously optimizing the objective function of the acoustic feature vector until the set iteration times so as to obtain the trained StyleGAN-EVC network; the StyleGAN-EVC network comprises a generator G, a discriminator D and a trellis encoder S;

the generator G is divided into an encoder and a decoder, the encoder is used for generating content characteristics, and the decoder is used for reconstructing the obtained content characteristics and the style characteristics extracted by the style encoder S to generate reconstructed voice;

step 4, constructing a fundamental frequency conversion function from the source emotion to the target emotion, introducing a fundamental frequency difference compensation vector on the basis of the traditional logarithmic Gaussian normalized fundamental frequency conversion, and constructing a final fundamental frequency conversion function;

the transition phase comprises the steps of:

step 5, selecting different emotion voices of a speaker as linguistic data to be converted, and respectively extracting source emotion Mel frequency spectrum characteristics x of the voices to be converted_sAnd target emotion Mel frequency spectrum feature x_tCorresponding logarithmic fundamental frequency characteristic of logf_0sAnd logf_0tAnd corresponding aperiodic characteristics AP_sAnd AP_tAs acoustic feature vectors;

step 6, the source emotion frequency spectrum characteristic x is used_sAnd target emotion spectrum feature x_tInputting the transformed emotion frequency spectrum characteristic x into the StyleGAN-EVC network trained in the step 3 to reconstruct_st；

Step 7, the source emotion logarithm base frequency feature logf extracted in the step 5 is subjected to the base frequency conversion function obtained in the step 4_0sFundamental frequency feature f converted into target emotion_0st；

Step 9, converting the emotion frequency spectrum feature x generated in the step 6_stAnd the base frequency characteristic f of the converted emotion obtained in the step 7_0stAnd 5, extracting the aperiodic characteristics AP of the source emotion in the step 5_sAnd synthesizing the converted emotional voice through a WORLD vocoder.

Further, the style encoder S is composed of a 5-layer 1-dimensional pooling module and a 5-layer 1-dimensional convolution module, wherein each layer of 1-dimensional pooling module is composed of average pooling, each layer of 1-dimensional convolution module includes convolution and ReLU activation functions, and the output layer is composed of fully-connected layers.

Further, the training process of step 3 includes the following steps:

step 3-1, carrying out frequency spectrum characteristic x of source emotion_sObtaining sentiment-independent semantic features G (x) in a coding network of an input generator G_s)；

Step 3-2, frequency spectrum characteristic x of target emotion_tInputting a style encoder S to obtain style characteristics S of the target emotion_t；

Step 3-3, generating the semantic features G (x)_s) And style characteristics s of target emotion_tThe decoding network input to the generator G is trained, and in the training process, the loss function of the generator G is minimized, so that the frequency spectrum characteristic x of the converted emotion is obtained_st；

Step 3-4, carrying out frequency spectrum characteristic x of source emotion_sInputting a style encoder S to obtain style characteristics S of the source emotion_s；

3-5, converting the generated frequency spectrum characteristic x of the emotion_stInputting the encoding network of the generator G again to obtain the semantic features G (x) irrelevant to emotion_st)；

Step 3-6, the generated semantic feature G (x)_st) Stylistic features s associated with source emotions_sInputting the decoding network of the generator G for training, minimizing the loss function of the generator G in the training process, and obtaining the frequency spectrum characteristic of the reconstructed source emotion

Step 3-7, converting the frequency spectrum characteristic x of the emotion generated in the step 3-3_stInputting the training result into a discriminator D to train, and minimizing a loss function of the discriminator D;

step 3-8, converting the frequency spectrum characteristic x of the emotion generated in the step 3-3_stInputting a style encoder S for training, and minimizing a style loss function of the style encoder S;

and 3-9, returning to the step 3-1, and repeating the steps until the set iteration number is reached, so as to obtain the trained StyleGAN-EVC network.

Further, the conversion process of step 6 comprises the following steps:

step 6-1, carrying out frequency spectrum characteristic x of source emotion_sInputting the coded network of the generator G trained in the step 3 to obtain semantic features G (x) irrelevant to emotion_s)；

Step 6-2, frequency spectrum characteristic x of target emotion_tThe input to the style encoder S is,obtaining style characteristics s of target emotion_t；

Step 6-3, generating the semantic features G (x) irrelevant to emotion_s) And style characteristics s of target emotion_tInputting the frequency spectrum characteristic x of the converted emotion into the decoding network of the generator G trained in the step 3_st。

Further, the style reconstruction loss function of the style encoder S is expressed as:

wherein the content of the first and second substances,

expectation of difference between style features of target emotion generated by style encoder and style features of transformed emotion, | | · | | survival₁Representing a 1 norm, S (-) being a stylistic coder, S (x)_t) A style feature representing a target emotion generated by a style encoder, G (-) being a generator, G (x)_s,S(x_t) Spectral features representing the transformed emotion, S (G (x)) generated by the generator_s,S(x_t) Stylistic features, x, representing the transformed emotion generated by the stylistic coder_sSpectral features of source emotion, x_tIs the spectral feature of the target emotion.

Further, the objective function of the StyleGAN-EVC network is expressed as:

L_StyleGAN＝L_G+L_D，

wherein L is_GTo a loss function of the generator, L_DIs a loss function of the discriminator;

loss function L of the generator_GExpressed as:

wherein λ is_cycAnd λ_styIs a set of regularized hyper-parameters, each representing a cycleThe weights of the ring consistency penalty and the lattice reconstruction penalty,

and

respectively representing the countermeasure loss, the cycle consistency loss and the style reconstruction loss of a style encoder of a generator;

loss function L of discriminator_DComprises the following steps:

wherein the content of the first and second substances,

is the challenge loss of the discriminator.

Further, the fundamental frequency conversion function is:

wherein, mu_sAnd mu_tMean, σ, of logarithmic fundamental frequency features representing source and target emotion, respectively_sAnd σ_tRespectively representing the mean square error of logarithmic fundamental frequency characteristics of the source emotion and the target emotion, and theta represents a fundamental frequency difference compensation vector;

the fundamental frequency difference compensation vector θ is expressed as:

wherein the content of the first and second substances,

the fundamental frequency characteristic mu which is obtained by linear interpolation or uniform sampling of the fundamental frequency characteristic representing the target emotion_t' mean of fundamental frequency features representing target emotion.

The invention has the beneficial effects that: (1) compared with the traditional one-hot vector which only has an indicating function and carries less specific emotional information, the emotion recognition method has the advantages that the emotion style characteristics extracted by the style encoder are used as the label information to participate in training, more emotional style information can be learned by a decoding network, the tone quality and emotion saturation of converted voice are improved, and better emotional converted voice is obtained;

(2) the invention provides a fundamental frequency difference compensation vector, corrects the amplitude change of emotion, improves the emotion saturation of converted speech, and overcomes the problems that after the conversion of the traditional logarithmic Gaussian normalized fundamental frequency function, the fundamental frequency envelope only rises integrally, and the mean value and the mean square error cannot accurately describe the difference of the amplitudes of different emotions;

(3) the training network is more efficient and stable, and can complete emotion voice conversion under open-set conditions, namely emotion voice conversion is realized under the condition that target emotion does not participate in training, only a few linguistic data of target sentences are needed during conversion, and the problem that target emotion needs to participate in training in practical application is solved. Therefore, the invention is an emotion voice conversion method with high tone quality and emotion saturation under open-set conditions.

Drawings

FIG. 1 is a schematic diagram of a model according to an embodiment of the present invention;

FIG. 2 is a network architecture diagram of a stylistic encoder of a model in accordance with an embodiment of the present invention;

FIG. 3 is a network architecture diagram of a generator of a model according to an embodiment of the invention;

FIG. 4 is a network architecture diagram of an evaluator of the model according to an embodiment of the method;

fig. 5 is a schematic diagram of the fundamental frequency conversion principle of the model according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a StyleGAN emotion voice conversion method based on fundamental frequency difference compensation, which is characterized in that a StyleGAN model is applied to emotion voice conversion, emotion style characteristics extracted by a style encoder are used as labels, and the emotion style characteristics are input into a decoder part of a generator and are reconstructed together with content characteristics separated by the encoder of the generator, so that converted emotion voice is generated, and the problem that training data in a traditional model need to be provided with the labels is solved; secondly, in a fundamental frequency conversion part, a fundamental frequency difference compensation vector is introduced on the basis of the traditional logarithmic Gaussian normalized fundamental frequency conversion, so that the amplitude variation of emotion is enhanced to generate a converted voice with emotion saturation, and the problem that after the traditional logarithmic Gaussian normalized fundamental frequency function conversion is carried out, the converted emotion voice only presents an overall tone rising or falling, but the variation trend of the amplitude is similar and no discrimination exists is solved; furthermore, the method can realize emotion voice conversion under the open set condition, namely, emotion conversion is carried out under the condition that the target emotion does not appear in the training set, and in the conversion stage, the target emotion can realize conversion only by less linguistic data, so that the method is more suitable for practical application. The proposed emotion Voice Conversion based on StyleGAN is called StyleGAN-EVC (empirical Voice Conversion With StyleGAN).

As shown in fig. 1, the method of the present embodiment is divided into two parts: the system comprises a training part and a conversion part, wherein the training part is used for obtaining a model for conversion, and the conversion part is used for realizing conversion from source emotion to target emotion.

The training phase comprises the following steps:

step 1, obtaining a training corpus, wherein the training corpus is from an ESD emotion corpus, the corpus is provided with 10 Chinese speakers and 10 English speakers, each language is provided with 5 male speakers and 5 female speakers, each speaker has 5 emotional sentences which are respectively neutral, angry, happy, sad and surprised, each emotion of each speaker has 350 sentences, wherein 281 sentences are selected as the training corpus, and 69 sentences are selected as the test corpus. In the experiment, 5 kinds of emotion corpora of one English female speaker are selected, wherein three kinds of emotion corpora of neutrality, anger and hurry are used as a training set, and five kinds of emotion testing corpora of neutrality, anger, joy, hurry and surprise are used as a testing set.

Step 2, extracting the spectrum envelope characteristic x, the aperiodic characteristic AP and the logarithmic fundamental frequency characteristic logf of each emotion statement from the training corpus through a WORLD vocoder₀。

Step 3, the stylistic features extracted by the stylistic coder are used as a decoding part of a label input generator, and are fully fused with the content features extracted by the encoding part of the generator by adopting example normalization, and further, a fundamental frequency difference compensation vector is introduced on the basis of a traditional logarithmic Gaussian normalized fundamental frequency conversion function, so that the amplitude change degree of emotion is enhanced. The StyleGAN-EVC network in this embodiment is composed of three parts: a generator G, a discriminator D and a trellis encoder S.

The objective function of the StyleGAN-EVC network in this embodiment is expressed as:

L_StyleGAN＝L_G+L_D，

loss function L of the generator_GExpressed as:

wherein λ is_cycAnd λ_styIs a group of regularization hyper-parameters which respectively represent the weight of the cycle consistency loss and the style reconstruction loss,

and

loss function L of discriminator_DComprises the following steps:

wherein the content of the first and second substances,

is the challenge loss of the discriminator.

Step 4, frequency spectrum characteristic x of target emotion_tInputting the style characteristics S of the target emotion into a style encoder S_t。

As shown in fig. 2, the stylistic encoder employs a 1-dimensional convolutional neural network, and the activation function uses a ReLU function. The style encoder is composed of a 5-layer 1-dimensional convolution module and a 5-layer 1-dimensional pooling module, wherein each layer of 1-dimensional convolution module comprises a convolution sum and a ReLU activation function, each layer of 1-dimensional pooling module is composed of average pooling, and an output layer is composed of full connection layers.

Step 5, extracting the frequency spectrum characteristic x of the source emotion_sAnd the style characteristic s of the target emotion obtained in the step 4_tTraining the generators together to make the loss function L of the generators_GObtaining the frequency spectrum characteristic x of the generated conversion emotion as small as possible_st。

As shown in fig. 3, the generator adopts a 2-dimensional dynamic convolution network, the activation function uses a Mish function, and the generator is composed of an encoder and a decoder. The coding network consists of 7 2-dimensional convolution modules, wherein the first 3 layers are 2-dimensional convolution modules, each 2-dimensional convolution module comprises a 2-dimensional convolution, instance normalization and a Mish function, the last 4 layers are dynamic convolution modules, and each dynamic convolution module comprises a dynamic convolution, an instance normalization and a Mish function. The decoding network is composed of 6 blocks of 2-dimensional convolution modules, wherein the first 4 layers are dynamic convolution modules, each layer of dynamic convolution module comprises dynamic convolution, self-adaptive instance normalization and a Mish function, the second 2 layers are 2-dimensional transposition convolution modules, and each layer of two-dimensional transposition convolution module comprises transposition dynamic convolution, self-adaptive instance normalization and a Mish function.

Step 6, the frequency spectrum characteristic x of the generated conversion emotion obtained in the step 5_stAnd the frequency spectrum characteristic x of the target emotion obtained in the step 2_tInputting the discriminator, training the discriminator to make the discriminator resist the loss function

As small as possible.

As shown in fig. 4, the discriminator is composed of a 5-layer 2-dimensional convolution module and an output layer. The 2-dimensional convolution module of each layer comprises 2-dimensional convolution and a LeakyReLU function, and the number of convolution channels of the output layer of the discriminator is set to be 1.

The loss function of the discriminator is:

wherein the content of the first and second substances,

is the challenge loss of the discriminator.

Wherein, D (x)_s) Representing discriminators D discriminating true spectral features, s_tStylistic features representing target emotion generated by stylistic coder S, namely S (x)_t)＝s_t，G(x_s,s_t) Representing the spectral characteristics of the transformation emotion, D (G (x), generated by the generator G_s,s_t) Is) indicative of the spectral signature generated by the discriminator discrimination,

representing the expectation of the probability distribution generated by the generator G,

an expectation representing a true probability distribution;

the optimization target is as follows:

step 7, the frequency spectrum characteristic x of the target emotion obtained in the step 5 is used_stInputting the data into the coding network of the generator G again to obtain the semantic features G (x) irrelevant to emotion_st) Spectral feature x of source emotion_sInputting the data into a style encoder S to obtain style characteristics S of the source emotion_sThe semantic feature G (x) obtained_st) Style characteristics s of source emotion_sThe signals are input into a decoding network of a generator G together for training, the loss function of the generator G is minimized in the training process, and the reconstructed frequency spectrum characteristic of the source emotion is obtained

And minimizing loss functions of the generator in the training process, wherein the loss functions comprise the countermeasure loss of the generator, the cycle consistency loss and the style reconstruction loss of the style encoder. Wherein the training cycle consistency loss is to make the source emotion frequency spectrum characteristic x_sAfter passing through the generator G, the reconstructed source emotion frequency spectrum characteristics

Can be mixed with x_sThe training style reconstruction loss is to restrict the style encoder to generate style features s more consistent with the target emotion_t；

The loss function of the generator is:

the optimization target is as follows:

wherein λ is_cycAnd λ_styIs a set of regularization hyper-parameters representing the weights of the cyclic consistency penalty and the stylistic reconstruction penalty, respectively.

Represents the penalty of the generator in GAN:

wherein the content of the first and second substances,

expectation, s, representing the probability distribution generated by the generator_tStylistic features, S (x), representing target emotions generated by a stylistic encoder_t)＝s_t，G(x_s,s_t) Spectral features, D (G (x), representing the transformed emotion generated by the generator_s,s_t) Means the discriminator discriminates the true target spectral features,

and loss of discriminator

Together, form the common countermeasures losses in GAN that are used to discriminate whether the spectrum input to the discriminator is the true spectrum or the generated spectrum. During the training process

As small as possible, the generator is continuously optimized until a spectral feature G (x) is generated that can be spurious_s,s_s) Making it difficult for the discriminator to discriminate between true and false.

To generate the cycle consistent loss in generator G:

wherein s is_sStylistic features representing source emotion, namely S (x)_s)＝s_s，G(G(x_s,s_t),s_s) To generate the spectral features of the reconstructed source emotion generated by the generator,

to reconstruct the loss expectation of the source emotion spectrum and the real source emotion spectrum, | · | computationally₁Representing a 1 norm. In the loss of the training generator(s),

as small as possible, to generate the spectral feature G (x) of the target emotion_s,s_t) Style characteristics of source emotion s_sAfter the frequency spectrum characteristics of the reconstructed source emotion voice are input into the generator again, the obtained frequency spectrum characteristics of the reconstructed source emotion voice are x as much as possible_sSimilarly. By training

The semantic features of the emotional voice can be effectively guaranteed not to be lost after being coded by the generator.

For the style reconstruction loss of the style encoder S, to optimize the style characteristics S_t：

Wherein s is_tStylistic features representing target emotion generated by stylistic coder S, namely S (x)_t)＝s_t，G(x_s,s_t) The spectral feature of the transformed emotion, generated by the generator, | · | | | non-woven₁Represents a 1 norm, S (G (x)_s,s_t) ) represents windThe style characteristics of the transformed emotion generated by the trellis encoder S.

The frequency spectrum characteristic G (x) of the target emotion_s,s_t) The reconstructed style characteristics are input into a style encoder S and are compared with the style characteristics S of the target emotion generated by the style encoder_tThe absolute value is calculated, and in the training process,

as small as possible, so that the stylistic features S of the target emotion generated by stylistic coder S_tThe characteristics of the target emotion can be fully expressed.

And 8, repeating the steps 4 to 7 until the set iteration number is reached, so as to obtain the trained StyleGAN-EVC network. Because the specific setting of the neural network is different and the performance of the experimental equipment is different, the set iteration times are different. The number of iterations in this experiment was set to 200000.

Step 9, using logarithmic Gaussian normalization fundamental frequency conversion to count the mean value and mean square deviation of logarithmic fundamental frequency of each emotion, and subjecting the logarithmic fundamental frequency characteristics of the source emotion to

Converting to obtain logarithmic fundamental frequency characteristics of target emotion

And then obtaining a fundamental frequency difference compensation vector by using the mean value of the fundamental frequency characteristics of the target emotion, and introducing the fundamental frequency difference compensation vector on the basis of the traditional logarithmic Gaussian normalized fundamental frequency conversion function to obtain a final fundamental frequency conversion function.

As shown in fig. 5, the fundamental transfer function is:

wherein the content of the first and second substances,

normalizing the fundamental frequency for logarithmic gaussiansAnd obtaining logarithmic fundamental frequency characteristics of the converted emotion after conversion, wherein theta is a fundamental frequency difference compensation vector.

Wherein, mu_sAnd mu_tMean, σ, of logarithmic fundamental frequency features representing source and target emotion, respectively_sAnd σ_tMean square deviations of the logarithmic fundamental frequency features representing the source emotion and the target emotion respectively. The sentences selected by the source emotion and the target emotion are non-parallel corpora, the fundamental frequency feature of the target emotion is not aligned with the fundamental frequency feature of the source emotion, if the fundamental frequency feature of the source emotion is larger than the fundamental frequency feature dimension of the target emotion, uniform sampling is adopted, otherwise, linear interpolation is carried out to obtain the fundamental frequency feature of the target emotion aligned with the source emotion, namely the fundamental frequency feature is the target emotion

μ_t' means of the fundamental frequency features representing the target emotion.

The transition phase comprises the following steps:

and step 10, extracting frequency spectrum characteristics, aperiodic characteristics and logarithmic fundamental frequency characteristics of the source emotion and the target emotion by using a WORLD encoder.

Step 11, inputting the frequency spectrum characteristics of the target emotion extracted in the step 10 into a style encoder S to obtain style characteristics S of the target emotion_t。

Step 12, the frequency spectrum characteristic x of the source emotion obtained in the step 10 is used_sInputting the style characteristic of the target emotion obtained in the step 10 into a trained StyleGAN-EVC network so as to reconstruct the frequency spectrum characteristic x of the target emotion_st。

And step 13, converting the logarithmic fundamental frequency features of the source emotion extracted in the step 10 into fundamental frequency features of the target emotion through the fundamental frequency conversion function obtained in the step 9.

Step 14, using a WORLD encoder to obtain the frequency spectrum characteristic x of the target emotion obtained in the step 12_stSynthesizing the converted target emotion voice by the fundamental frequency characteristics of the converted emotion obtained in the step 13 and the aperiodic characteristics extracted in the step 10.

The above description is an exemplary embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A StyleGAN emotion voice conversion method based on fundamental frequency difference compensation is characterized by comprising a training phase and a conversion phase, wherein the training phase comprises the following steps:

the transition phase comprises the steps of:

step 5,Selecting different emotional voices of a speaker as linguistic data to be converted, and respectively extracting source emotional Mel frequency spectrum characteristics x of the voices to be converted_sAnd target emotion Mel frequency spectrum feature x_tCorresponding logarithmic fundamental frequency characteristics

And

and corresponding aperiodic characteristics AP_sAnd AP_tAs acoustic feature vectors;

Step 7, the source emotion logarithm base frequency characteristic log extracted in the step 5 is subjected to the base frequency conversion function obtained in the step 4

Fundamental frequency features converted into target emotion

Step 9, converting the emotion frequency spectrum feature x generated in the step 6_stAnd 7, obtaining the fundamental frequency characteristic of the converted emotion in the step 7

Aperiodic characteristic AP of source emotion extracted in step 5_sAnd synthesizing the converted emotional voice through a WORLD vocoder.

2. The method of StyleGAN emotion speech conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein said style coder S is composed of 5 layers of 1-dimensional pooling module and 5 layers of 1-dimensional convolution module, wherein each layer of 1-dimensional pooling module is composed of average pooling, each layer of 1-dimensional convolution module comprises convolution and ReLU activation functions, and the output layer is composed of fully connected layers.

3. The method for StyleGAN emotion phonetic conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein the training procedure of step 3 comprises the following steps:

step 3-8, converting the frequency spectrum characteristic x of the emotion generated in the step 3-3_stInput styleTraining the encoder S to minimize a style loss function of the style encoder S;

4. The method for StyleGAN emotion speaking and converting based on fundamental frequency difference compensation as claimed in claim 1, wherein said converting procedure of step 6 comprises the steps of:

Step 6-2, frequency spectrum characteristic x of target emotion_tInputting the style characteristics S of the target emotion into a style encoder S_t；

5. The method of claim 1, wherein the stylistic reconstruction loss function of the stylistic coder (S) is expressed as:

wherein the content of the first and second substances,

6. The method for StyleGAN emotion speech conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein the objective function of the StyleGAN-EVC network is expressed as:

L_StyleGAN＝L_G+L_D，

loss function L of the generator_GExpressed as:

and

loss function L of discriminator_DComprises the following steps:

wherein the content of the first and second substances,

is a discriminatorIs lost.

7. The method for StyleGAN emotion speech conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein the fundamental frequency conversion function is:

the fundamental frequency difference compensation vector θ is expressed as:

wherein the content of the first and second substances,