CN114299917A - StyleGAN emotion voice conversion method based on fundamental frequency difference compensation - Google Patents

StyleGAN emotion voice conversion method based on fundamental frequency difference compensation Download PDF

Info

Publication number
CN114299917A
CN114299917A CN202210004168.4A CN202210004168A CN114299917A CN 114299917 A CN114299917 A CN 114299917A CN 202210004168 A CN202210004168 A CN 202210004168A CN 114299917 A CN114299917 A CN 114299917A
Authority
CN
China
Prior art keywords
emotion
style
fundamental frequency
generator
stylegan
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210004168.4A
Other languages
Chinese (zh)
Inventor
李燕萍
于杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210004168.4A priority Critical patent/CN114299917A/en
Publication of CN114299917A publication Critical patent/CN114299917A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a fundamental frequency difference compensation-based StyleGAN (style generated adaptive network) emotion voice conversion method, which comprises the steps of firstly, providing a plurality of emotion style characteristics extracted by a style encoder as label information, and fusing the emotion style characteristics and content characteristics by adopting adaptive example normalization, so that a generator can fully learn the style characteristics of target emotion, and can realize conversion between emotions which are not in a training set, namely, conversion between any emotions under an open set condition is completed; furthermore, based on the traditional logarithmic Gaussian normalized base frequency conversion, base frequency difference compensation is provided, so that the amplitude difference between different emotions is enhanced, the problems that the base frequency only rises integrally after the traditional logarithmic Gaussian normalized base frequency conversion is carried out, and the average value and the mean square error cannot accurately describe the base frequency envelope amplitude difference of different emotions are solved, the emotion saturation of the converted voice is effectively improved, and high-quality emotion voice conversion under the open-set condition is realized.

Description

StyleGAN emotion voice conversion method based on fundamental frequency difference compensation
Technical Field
The invention relates to the technical field of voice conversion, in particular to a StyleGAN emotion voice conversion method based on fundamental frequency difference compensation.
Background
The voice conversion is an important research branch in the field of voice signal processing, and aims to convert the individual characteristics of a source speaker into the individual characteristics of a target speaker on the premise of keeping the voice content of the speaker. Speech conveys information through language and prosody, which can not only affect the syntactic and semantic interpretation of speech (linguistic prosody), but can also convey one's emotional state (emotional prosody). The traditional voice conversion mainly focuses on the personality characteristic conversion of a speaker, and less focuses on the prosody conversion of the speaker. The emotion voice conversion is to convert the emotion prosody into the emotion prosody of the target sentence on the premise of keeping the original voice content and the individual characteristics of the speaker unchanged.
The early emotion voice conversion is a conversion result obtained by training spectral mapping between source and target sentences, and j.tao et al propose a method of decomposing a pitch contour of source voice into a hierarchical structure by using a classification regression tree, and then using a Gaussian Mixture Model (GMM) and a regression-based clustering method; then, r.ahara et al propose a sample-based conversion method, which encodes source speech using parallel samples, and synthesizes target speech; ming et al extend the method to a unified sample-based emotion speech conversion framework, while learning the mapping of spectral features and Continuous Wavelet Transform (CWT) -based fundamental frequency features. With the occurrence of Deep learning, the performance of voice conversion is remarkably improved, and methods based on Deep Neural Networks (DNN), Deep Belief Networks (DBN), high-speed Neural networks (Highway networks) and Deep Bidirectional Short Term Memory networks (dblas) are continuously proposed, so that the conversion of spectrum characteristics and prosody characteristics is better realized.
The emotion voice conversion method discussed above is basically based on conversion under parallel corpora, but it is often difficult, time-consuming and labor-consuming to collect a large number of parallel emotion corpora; in addition, in the emotional voice conversion method under the condition of parallel corpora, alignment operation is usually required in the training stage, and additional distortion is inevitably introduced to influence the performance of the conversion model. Therefore, the emotion voice conversion research under the condition of non-parallel texts has greater application value and practical significance in consideration of universality and practicability of the emotion voice conversion system. Therefore, methods have been proposed to relieve the requirement of emotion voice conversion methods for parallel data, such as frameworks based on cyclic Consistent adaptive Network (CycleGAN), Variational Auto-encoder (VAE), Star-generated adaptive Network (StarGAN), etc., but these methods focus on only spectral feature conversion, and for Fundamental frequency feature (F0), conversion is simply performed using traditional logarithmic gaussian normalization. Later, Kun Zhou et al proposed a CycleGAN based on CWT, a conditional Variational Auto-encoder based on CWT, and a method of generating a countermeasure Network (VAWGAN), where F0 was trained by an input model after dimension expansion through CWT, but the conversion function still adopts the conventional logarithmic Gaussian normalization. Later, Georgios Rizos et al proposed an improved logarithmic gaussian normalization fundamental frequency transfer function, which introduces the difference between the mean and mean square error of the source emotion and the target emotion on the basis of the conventional logarithmic gaussian normalization, but when the amplitude of the emotion is changed too much, the mean and mean square error cannot accurately describe the specific fluctuation difference of the fundamental frequency envelope amplitude of the two emotions, and only the amplitude of the source emotion can be integrally processed to the amplitude range of the target emotion.
Further, most of the above mentioned methods are emotion voice conversion in a closed set situation, i.e. corpus of target emotion participates in training, and training data is labeled, in this case, the quality of converted voice is relatively good, however, in an actual application scenario, target emotion may only be a few corpus or only one word participates in training, even does not participate in training, and such problems can be classified as emotion voice conversion in an open set situation, i.e. any emotion voice conversion problem. Under the open-set situation, how to improve the tone quality and emotion saturation of emotion conversion voice becomes a research hotspot and difficulty in the field at present.
Disclosure of Invention
In order to solve the technical problems, the invention provides an emotion voice conversion method of StyleGAN based on fundamental frequency difference compensation, which uses a style encoder to extract emotion style characteristics as tags to participate in training, realizes emotion voice conversion under an open set condition, and solves the problems that the training data of the existing method needs tags and is only suitable for a closed set condition; further, a fundamental frequency difference compensation vector is provided, the problem that after the fundamental frequency conversion is carried out by the traditional logarithmic Gaussian normalization, the converted fundamental frequency only shows integrity rise, and the mean value and the mean square error cannot accurately describe the difference between the amplitudes of different emotions is solved, the emotion saturation of the converted voice is improved, and high-quality emotion voice conversion under the open-set condition is realized.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention firstly provides a StyleGAN emotion voice conversion method based on fundamental frequency difference compensation, which comprises a training stage and a conversion stage, wherein the training stage comprises the following steps:
step 1, obtaining a training corpus, wherein the training corpus consists of corpuses of various emotions of a speaker, and the emotions comprise source emotions and target emotions;
step 2, extracting frequency spectrum characteristics of different emotion voices from the training corpus to serve as acoustic characteristic vectors;
step 3, inputting the obtained acoustic feature vector into the StyleGAN-EVC network for training, and continuously optimizing the objective function of the acoustic feature vector until the set iteration times so as to obtain the trained StyleGAN-EVC network; the StyleGAN-EVC network comprises a generator G, a discriminator D and a trellis encoder S;
the generator G is divided into an encoder and a decoder, the encoder is used for generating content characteristics, and the decoder is used for reconstructing the obtained content characteristics and the style characteristics extracted by the style encoder S to generate reconstructed voice;
step 4, constructing a fundamental frequency conversion function from the source emotion to the target emotion, introducing a fundamental frequency difference compensation vector on the basis of the traditional logarithmic Gaussian normalized fundamental frequency conversion, and constructing a final fundamental frequency conversion function;
the transition phase comprises the steps of:
step 5, selecting different emotion voices of a speaker as linguistic data to be converted, and respectively extracting source emotion Mel frequency spectrum characteristics x of the voices to be convertedsAnd target emotion Mel frequency spectrum feature xtCorresponding logarithmic fundamental frequency characteristic of logf0sAnd logf0tAnd corresponding aperiodic characteristics APsAnd APtAs acoustic feature vectors;
step 6, the source emotion frequency spectrum characteristic x is usedsAnd target emotion spectrum feature xtInputting the transformed emotion frequency spectrum characteristic x into the StyleGAN-EVC network trained in the step 3 to reconstructst
Step 7, the source emotion logarithm base frequency feature logf extracted in the step 5 is subjected to the base frequency conversion function obtained in the step 40sFundamental frequency feature f converted into target emotion0st
Step 9, converting the emotion frequency spectrum feature x generated in the step 6stAnd the base frequency characteristic f of the converted emotion obtained in the step 70stAnd 5, extracting the aperiodic characteristics AP of the source emotion in the step 5sAnd synthesizing the converted emotional voice through a WORLD vocoder.
Further, the style encoder S is composed of a 5-layer 1-dimensional pooling module and a 5-layer 1-dimensional convolution module, wherein each layer of 1-dimensional pooling module is composed of average pooling, each layer of 1-dimensional convolution module includes convolution and ReLU activation functions, and the output layer is composed of fully-connected layers.
Further, the training process of step 3 includes the following steps:
step 3-1, carrying out frequency spectrum characteristic x of source emotionsObtaining sentiment-independent semantic features G (x) in a coding network of an input generator Gs);
Step 3-2, frequency spectrum characteristic x of target emotiontInputting a style encoder S to obtain style characteristics S of the target emotiont
Step 3-3, generating the semantic features G (x)s) And style characteristics s of target emotiontThe decoding network input to the generator G is trained, and in the training process, the loss function of the generator G is minimized, so that the frequency spectrum characteristic x of the converted emotion is obtainedst
Step 3-4, carrying out frequency spectrum characteristic x of source emotionsInputting a style encoder S to obtain style characteristics S of the source emotions
3-5, converting the generated frequency spectrum characteristic x of the emotionstInputting the encoding network of the generator G again to obtain the semantic features G (x) irrelevant to emotionst);
Step 3-6, the generated semantic feature G (x)st) Stylistic features s associated with source emotionssInputting the decoding network of the generator G for training, minimizing the loss function of the generator G in the training process, and obtaining the frequency spectrum characteristic of the reconstructed source emotion
Figure BDA0003454834370000043
Step 3-7, converting the frequency spectrum characteristic x of the emotion generated in the step 3-3stInputting the training result into a discriminator D to train, and minimizing a loss function of the discriminator D;
step 3-8, converting the frequency spectrum characteristic x of the emotion generated in the step 3-3stInputting a style encoder S for training, and minimizing a style loss function of the style encoder S;
and 3-9, returning to the step 3-1, and repeating the steps until the set iteration number is reached, so as to obtain the trained StyleGAN-EVC network.
Further, the conversion process of step 6 comprises the following steps:
step 6-1, carrying out frequency spectrum characteristic x of source emotionsInputting the coded network of the generator G trained in the step 3 to obtain semantic features G (x) irrelevant to emotions);
Step 6-2, frequency spectrum characteristic x of target emotiontThe input to the style encoder S is,obtaining style characteristics s of target emotiont
Step 6-3, generating the semantic features G (x) irrelevant to emotions) And style characteristics s of target emotiontInputting the frequency spectrum characteristic x of the converted emotion into the decoding network of the generator G trained in the step 3st
Further, the style reconstruction loss function of the style encoder S is expressed as:
Figure BDA0003454834370000041
wherein the content of the first and second substances,
Figure BDA0003454834370000042
expectation of difference between style features of target emotion generated by style encoder and style features of transformed emotion, | | · | | survival1Representing a 1 norm, S (-) being a stylistic coder, S (x)t) A style feature representing a target emotion generated by a style encoder, G (-) being a generator, G (x)s,S(xt) Spectral features representing the transformed emotion, S (G (x)) generated by the generators,S(xt) Stylistic features, x, representing the transformed emotion generated by the stylistic codersSpectral features of source emotion, xtIs the spectral feature of the target emotion.
Further, the objective function of the StyleGAN-EVC network is expressed as:
LStyleGAN=LG+LD
wherein L isGTo a loss function of the generator, LDIs a loss function of the discriminator;
loss function L of the generatorGExpressed as:
Figure BDA0003454834370000051
wherein λ iscycAnd λstyIs a set of regularized hyper-parameters, each representing a cycleThe weights of the ring consistency penalty and the lattice reconstruction penalty,
Figure BDA0003454834370000052
and
Figure BDA0003454834370000053
respectively representing the countermeasure loss, the cycle consistency loss and the style reconstruction loss of a style encoder of a generator;
loss function L of discriminatorDComprises the following steps:
Figure BDA0003454834370000054
wherein the content of the first and second substances,
Figure BDA0003454834370000055
is the challenge loss of the discriminator.
Further, the fundamental frequency conversion function is:
Figure BDA0003454834370000056
wherein, musAnd mutMean, σ, of logarithmic fundamental frequency features representing source and target emotion, respectivelysAnd σtRespectively representing the mean square error of logarithmic fundamental frequency characteristics of the source emotion and the target emotion, and theta represents a fundamental frequency difference compensation vector;
the fundamental frequency difference compensation vector θ is expressed as:
Figure BDA0003454834370000057
wherein the content of the first and second substances,
Figure BDA0003454834370000058
the fundamental frequency characteristic mu which is obtained by linear interpolation or uniform sampling of the fundamental frequency characteristic representing the target emotiont' mean of fundamental frequency features representing target emotion.
The invention has the beneficial effects that: (1) compared with the traditional one-hot vector which only has an indicating function and carries less specific emotional information, the emotion recognition method has the advantages that the emotion style characteristics extracted by the style encoder are used as the label information to participate in training, more emotional style information can be learned by a decoding network, the tone quality and emotion saturation of converted voice are improved, and better emotional converted voice is obtained;
(2) the invention provides a fundamental frequency difference compensation vector, corrects the amplitude change of emotion, improves the emotion saturation of converted speech, and overcomes the problems that after the conversion of the traditional logarithmic Gaussian normalized fundamental frequency function, the fundamental frequency envelope only rises integrally, and the mean value and the mean square error cannot accurately describe the difference of the amplitudes of different emotions;
(3) the training network is more efficient and stable, and can complete emotion voice conversion under open-set conditions, namely emotion voice conversion is realized under the condition that target emotion does not participate in training, only a few linguistic data of target sentences are needed during conversion, and the problem that target emotion needs to participate in training in practical application is solved. Therefore, the invention is an emotion voice conversion method with high tone quality and emotion saturation under open-set conditions.
Drawings
FIG. 1 is a schematic diagram of a model according to an embodiment of the present invention;
FIG. 2 is a network architecture diagram of a stylistic encoder of a model in accordance with an embodiment of the present invention;
FIG. 3 is a network architecture diagram of a generator of a model according to an embodiment of the invention;
FIG. 4 is a network architecture diagram of an evaluator of the model according to an embodiment of the method;
fig. 5 is a schematic diagram of the fundamental frequency conversion principle of the model according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a StyleGAN emotion voice conversion method based on fundamental frequency difference compensation, which is characterized in that a StyleGAN model is applied to emotion voice conversion, emotion style characteristics extracted by a style encoder are used as labels, and the emotion style characteristics are input into a decoder part of a generator and are reconstructed together with content characteristics separated by the encoder of the generator, so that converted emotion voice is generated, and the problem that training data in a traditional model need to be provided with the labels is solved; secondly, in a fundamental frequency conversion part, a fundamental frequency difference compensation vector is introduced on the basis of the traditional logarithmic Gaussian normalized fundamental frequency conversion, so that the amplitude variation of emotion is enhanced to generate a converted voice with emotion saturation, and the problem that after the traditional logarithmic Gaussian normalized fundamental frequency function conversion is carried out, the converted emotion voice only presents an overall tone rising or falling, but the variation trend of the amplitude is similar and no discrimination exists is solved; furthermore, the method can realize emotion voice conversion under the open set condition, namely, emotion conversion is carried out under the condition that the target emotion does not appear in the training set, and in the conversion stage, the target emotion can realize conversion only by less linguistic data, so that the method is more suitable for practical application. The proposed emotion Voice Conversion based on StyleGAN is called StyleGAN-EVC (empirical Voice Conversion With StyleGAN).
As shown in fig. 1, the method of the present embodiment is divided into two parts: the system comprises a training part and a conversion part, wherein the training part is used for obtaining a model for conversion, and the conversion part is used for realizing conversion from source emotion to target emotion.
The training phase comprises the following steps:
step 1, obtaining a training corpus, wherein the training corpus is from an ESD emotion corpus, the corpus is provided with 10 Chinese speakers and 10 English speakers, each language is provided with 5 male speakers and 5 female speakers, each speaker has 5 emotional sentences which are respectively neutral, angry, happy, sad and surprised, each emotion of each speaker has 350 sentences, wherein 281 sentences are selected as the training corpus, and 69 sentences are selected as the test corpus. In the experiment, 5 kinds of emotion corpora of one English female speaker are selected, wherein three kinds of emotion corpora of neutrality, anger and hurry are used as a training set, and five kinds of emotion testing corpora of neutrality, anger, joy, hurry and surprise are used as a testing set.
Step 2, extracting the spectrum envelope characteristic x, the aperiodic characteristic AP and the logarithmic fundamental frequency characteristic logf of each emotion statement from the training corpus through a WORLD vocoder0
Step 3, the stylistic features extracted by the stylistic coder are used as a decoding part of a label input generator, and are fully fused with the content features extracted by the encoding part of the generator by adopting example normalization, and further, a fundamental frequency difference compensation vector is introduced on the basis of a traditional logarithmic Gaussian normalized fundamental frequency conversion function, so that the amplitude change degree of emotion is enhanced. The StyleGAN-EVC network in this embodiment is composed of three parts: a generator G, a discriminator D and a trellis encoder S.
The objective function of the StyleGAN-EVC network in this embodiment is expressed as:
LStyleGAN=LG+LD
wherein L isGTo a loss function of the generator, LDIs a loss function of the discriminator;
loss function L of the generatorGExpressed as:
Figure BDA0003454834370000071
wherein λ iscycAnd λstyIs a group of regularization hyper-parameters which respectively represent the weight of the cycle consistency loss and the style reconstruction loss,
Figure BDA0003454834370000072
and
Figure BDA0003454834370000073
respectively representing the countermeasure loss, the cycle consistency loss and the style reconstruction loss of a style encoder of a generator;
loss function L of discriminatorDComprises the following steps:
Figure BDA0003454834370000074
wherein the content of the first and second substances,
Figure BDA0003454834370000081
is the challenge loss of the discriminator.
Step 4, frequency spectrum characteristic x of target emotiontInputting the style characteristics S of the target emotion into a style encoder St
As shown in fig. 2, the stylistic encoder employs a 1-dimensional convolutional neural network, and the activation function uses a ReLU function. The style encoder is composed of a 5-layer 1-dimensional convolution module and a 5-layer 1-dimensional pooling module, wherein each layer of 1-dimensional convolution module comprises a convolution sum and a ReLU activation function, each layer of 1-dimensional pooling module is composed of average pooling, and an output layer is composed of full connection layers.
Step 5, extracting the frequency spectrum characteristic x of the source emotionsAnd the style characteristic s of the target emotion obtained in the step 4tTraining the generators together to make the loss function L of the generatorsGObtaining the frequency spectrum characteristic x of the generated conversion emotion as small as possiblest
As shown in fig. 3, the generator adopts a 2-dimensional dynamic convolution network, the activation function uses a Mish function, and the generator is composed of an encoder and a decoder. The coding network consists of 7 2-dimensional convolution modules, wherein the first 3 layers are 2-dimensional convolution modules, each 2-dimensional convolution module comprises a 2-dimensional convolution, instance normalization and a Mish function, the last 4 layers are dynamic convolution modules, and each dynamic convolution module comprises a dynamic convolution, an instance normalization and a Mish function. The decoding network is composed of 6 blocks of 2-dimensional convolution modules, wherein the first 4 layers are dynamic convolution modules, each layer of dynamic convolution module comprises dynamic convolution, self-adaptive instance normalization and a Mish function, the second 2 layers are 2-dimensional transposition convolution modules, and each layer of two-dimensional transposition convolution module comprises transposition dynamic convolution, self-adaptive instance normalization and a Mish function.
Step 6, the frequency spectrum characteristic x of the generated conversion emotion obtained in the step 5stAnd the frequency spectrum characteristic x of the target emotion obtained in the step 2tInputting the discriminator, training the discriminator to make the discriminator resist the loss function
Figure BDA0003454834370000082
As small as possible.
As shown in fig. 4, the discriminator is composed of a 5-layer 2-dimensional convolution module and an output layer. The 2-dimensional convolution module of each layer comprises 2-dimensional convolution and a LeakyReLU function, and the number of convolution channels of the output layer of the discriminator is set to be 1.
The loss function of the discriminator is:
Figure BDA0003454834370000083
wherein the content of the first and second substances,
Figure BDA0003454834370000084
is the challenge loss of the discriminator.
Figure BDA0003454834370000085
Wherein, D (x)s) Representing discriminators D discriminating true spectral features, stStylistic features representing target emotion generated by stylistic coder S, namely S (x)t)=st,G(xs,st) Representing the spectral characteristics of the transformation emotion, D (G (x), generated by the generator Gs,st) Is) indicative of the spectral signature generated by the discriminator discrimination,
Figure BDA0003454834370000091
representing the expectation of the probability distribution generated by the generator G,
Figure BDA0003454834370000092
an expectation representing a true probability distribution;
the optimization target is as follows:
Figure BDA0003454834370000093
step 7, the frequency spectrum characteristic x of the target emotion obtained in the step 5 is usedstInputting the data into the coding network of the generator G again to obtain the semantic features G (x) irrelevant to emotionst) Spectral feature x of source emotionsInputting the data into a style encoder S to obtain style characteristics S of the source emotionsThe semantic feature G (x) obtainedst) Style characteristics s of source emotionsThe signals are input into a decoding network of a generator G together for training, the loss function of the generator G is minimized in the training process, and the reconstructed frequency spectrum characteristic of the source emotion is obtained
Figure BDA0003454834370000094
And minimizing loss functions of the generator in the training process, wherein the loss functions comprise the countermeasure loss of the generator, the cycle consistency loss and the style reconstruction loss of the style encoder. Wherein the training cycle consistency loss is to make the source emotion frequency spectrum characteristic xsAfter passing through the generator G, the reconstructed source emotion frequency spectrum characteristics
Figure BDA0003454834370000095
Can be mixed with xsThe training style reconstruction loss is to restrict the style encoder to generate style features s more consistent with the target emotiont
The loss function of the generator is:
Figure BDA0003454834370000096
the optimization target is as follows:
Figure BDA0003454834370000097
wherein λ iscycAnd λstyIs a set of regularization hyper-parameters representing the weights of the cyclic consistency penalty and the stylistic reconstruction penalty, respectively.
Figure BDA0003454834370000098
Represents the penalty of the generator in GAN:
Figure BDA0003454834370000099
wherein the content of the first and second substances,
Figure BDA00034548343700000910
expectation, s, representing the probability distribution generated by the generatortStylistic features, S (x), representing target emotions generated by a stylistic encodert)=st,G(xs,st) Spectral features, D (G (x), representing the transformed emotion generated by the generators,st) Means the discriminator discriminates the true target spectral features,
Figure BDA00034548343700000911
and loss of discriminator
Figure BDA00034548343700000912
Together, form the common countermeasures losses in GAN that are used to discriminate whether the spectrum input to the discriminator is the true spectrum or the generated spectrum. During the training process
Figure BDA0003454834370000101
As small as possible, the generator is continuously optimized until a spectral feature G (x) is generated that can be spuriouss,ss) Making it difficult for the discriminator to discriminate between true and false.
Figure BDA0003454834370000102
To generate the cycle consistent loss in generator G:
Figure BDA0003454834370000103
wherein s issStylistic features representing source emotion, namely S (x)s)=ss,G(G(xs,st),ss) To generate the spectral features of the reconstructed source emotion generated by the generator,
Figure BDA0003454834370000104
to reconstruct the loss expectation of the source emotion spectrum and the real source emotion spectrum, | · | computationally1Representing a 1 norm. In the loss of the training generator(s),
Figure BDA0003454834370000105
as small as possible, to generate the spectral feature G (x) of the target emotions,st) Style characteristics of source emotion ssAfter the frequency spectrum characteristics of the reconstructed source emotion voice are input into the generator again, the obtained frequency spectrum characteristics of the reconstructed source emotion voice are x as much as possiblesSimilarly. By training
Figure BDA0003454834370000106
The semantic features of the emotional voice can be effectively guaranteed not to be lost after being coded by the generator.
Figure BDA0003454834370000107
For the style reconstruction loss of the style encoder S, to optimize the style characteristics St
Figure BDA0003454834370000108
Wherein s istStylistic features representing target emotion generated by stylistic coder S, namely S (x)t)=st,G(xs,st) The spectral feature of the transformed emotion, generated by the generator, | · | | | non-woven1Represents a 1 norm, S (G (x)s,st) ) represents windThe style characteristics of the transformed emotion generated by the trellis encoder S.
The frequency spectrum characteristic G (x) of the target emotions,st) The reconstructed style characteristics are input into a style encoder S and are compared with the style characteristics S of the target emotion generated by the style encodertThe absolute value is calculated, and in the training process,
Figure BDA0003454834370000109
as small as possible, so that the stylistic features S of the target emotion generated by stylistic coder StThe characteristics of the target emotion can be fully expressed.
And 8, repeating the steps 4 to 7 until the set iteration number is reached, so as to obtain the trained StyleGAN-EVC network. Because the specific setting of the neural network is different and the performance of the experimental equipment is different, the set iteration times are different. The number of iterations in this experiment was set to 200000.
Step 9, using logarithmic Gaussian normalization fundamental frequency conversion to count the mean value and mean square deviation of logarithmic fundamental frequency of each emotion, and subjecting the logarithmic fundamental frequency characteristics of the source emotion to
Figure BDA00034548343700001010
Converting to obtain logarithmic fundamental frequency characteristics of target emotion
Figure BDA00034548343700001011
And then obtaining a fundamental frequency difference compensation vector by using the mean value of the fundamental frequency characteristics of the target emotion, and introducing the fundamental frequency difference compensation vector on the basis of the traditional logarithmic Gaussian normalized fundamental frequency conversion function to obtain a final fundamental frequency conversion function.
As shown in fig. 5, the fundamental transfer function is:
Figure BDA0003454834370000111
wherein the content of the first and second substances,
Figure BDA0003454834370000112
normalizing the fundamental frequency for logarithmic gaussiansAnd obtaining logarithmic fundamental frequency characteristics of the converted emotion after conversion, wherein theta is a fundamental frequency difference compensation vector.
Figure BDA0003454834370000113
Figure BDA0003454834370000114
Wherein, musAnd mutMean, σ, of logarithmic fundamental frequency features representing source and target emotion, respectivelysAnd σtMean square deviations of the logarithmic fundamental frequency features representing the source emotion and the target emotion respectively. The sentences selected by the source emotion and the target emotion are non-parallel corpora, the fundamental frequency feature of the target emotion is not aligned with the fundamental frequency feature of the source emotion, if the fundamental frequency feature of the source emotion is larger than the fundamental frequency feature dimension of the target emotion, uniform sampling is adopted, otherwise, linear interpolation is carried out to obtain the fundamental frequency feature of the target emotion aligned with the source emotion, namely the fundamental frequency feature is the target emotion
Figure BDA0003454834370000115
μt' means of the fundamental frequency features representing the target emotion.
The transition phase comprises the following steps:
and step 10, extracting frequency spectrum characteristics, aperiodic characteristics and logarithmic fundamental frequency characteristics of the source emotion and the target emotion by using a WORLD encoder.
Step 11, inputting the frequency spectrum characteristics of the target emotion extracted in the step 10 into a style encoder S to obtain style characteristics S of the target emotiont
Step 12, the frequency spectrum characteristic x of the source emotion obtained in the step 10 is usedsInputting the style characteristic of the target emotion obtained in the step 10 into a trained StyleGAN-EVC network so as to reconstruct the frequency spectrum characteristic x of the target emotionst
And step 13, converting the logarithmic fundamental frequency features of the source emotion extracted in the step 10 into fundamental frequency features of the target emotion through the fundamental frequency conversion function obtained in the step 9.
Step 14, using a WORLD encoder to obtain the frequency spectrum characteristic x of the target emotion obtained in the step 12stSynthesizing the converted target emotion voice by the fundamental frequency characteristics of the converted emotion obtained in the step 13 and the aperiodic characteristics extracted in the step 10.
The above description is an exemplary embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (7)

1. A StyleGAN emotion voice conversion method based on fundamental frequency difference compensation is characterized by comprising a training phase and a conversion phase, wherein the training phase comprises the following steps:
step 1, obtaining a training corpus, wherein the training corpus consists of corpuses of various emotions of a speaker, and the emotions comprise source emotions and target emotions;
step 2, extracting frequency spectrum characteristics of different emotion voices from the training corpus to serve as acoustic characteristic vectors;
step 3, inputting the obtained acoustic feature vector into the StyleGAN-EVC network for training, and continuously optimizing the objective function of the acoustic feature vector until the set iteration times so as to obtain the trained StyleGAN-EVC network; the StyleGAN-EVC network comprises a generator G, a discriminator D and a trellis encoder S;
the generator G is divided into an encoder and a decoder, the encoder is used for generating content characteristics, and the decoder is used for reconstructing the obtained content characteristics and the style characteristics extracted by the style encoder S to generate reconstructed voice;
step 4, constructing a fundamental frequency conversion function from the source emotion to the target emotion, introducing a fundamental frequency difference compensation vector on the basis of the traditional logarithmic Gaussian normalized fundamental frequency conversion, and constructing a final fundamental frequency conversion function;
the transition phase comprises the steps of:
step 5,Selecting different emotional voices of a speaker as linguistic data to be converted, and respectively extracting source emotional Mel frequency spectrum characteristics x of the voices to be convertedsAnd target emotion Mel frequency spectrum feature xtCorresponding logarithmic fundamental frequency characteristics
Figure FDA0003454834360000014
And
Figure FDA0003454834360000015
and corresponding aperiodic characteristics APsAnd APtAs acoustic feature vectors;
step 6, the source emotion frequency spectrum characteristic x is usedsAnd target emotion spectrum feature xtInputting the transformed emotion frequency spectrum characteristic x into the StyleGAN-EVC network trained in the step 3 to reconstructst
Step 7, the source emotion logarithm base frequency characteristic log extracted in the step 5 is subjected to the base frequency conversion function obtained in the step 4
Figure FDA0003454834360000012
Fundamental frequency features converted into target emotion
Figure FDA0003454834360000013
Step 9, converting the emotion frequency spectrum feature x generated in the step 6stAnd 7, obtaining the fundamental frequency characteristic of the converted emotion in the step 7
Figure FDA0003454834360000011
Aperiodic characteristic AP of source emotion extracted in step 5sAnd synthesizing the converted emotional voice through a WORLD vocoder.
2. The method of StyleGAN emotion speech conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein said style coder S is composed of 5 layers of 1-dimensional pooling module and 5 layers of 1-dimensional convolution module, wherein each layer of 1-dimensional pooling module is composed of average pooling, each layer of 1-dimensional convolution module comprises convolution and ReLU activation functions, and the output layer is composed of fully connected layers.
3. The method for StyleGAN emotion phonetic conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein the training procedure of step 3 comprises the following steps:
step 3-1, carrying out frequency spectrum characteristic x of source emotionsObtaining sentiment-independent semantic features G (x) in a coding network of an input generator Gs);
Step 3-2, frequency spectrum characteristic x of target emotiontInputting a style encoder S to obtain style characteristics S of the target emotiont
Step 3-3, generating the semantic features G (x)s) And style characteristics s of target emotiontThe decoding network input to the generator G is trained, and in the training process, the loss function of the generator G is minimized, so that the frequency spectrum characteristic x of the converted emotion is obtainedst
Step 3-4, carrying out frequency spectrum characteristic x of source emotionsInputting a style encoder S to obtain style characteristics S of the source emotions
3-5, converting the generated frequency spectrum characteristic x of the emotionstInputting the encoding network of the generator G again to obtain the semantic features G (x) irrelevant to emotionst);
Step 3-6, the generated semantic feature G (x)st) Stylistic features s associated with source emotionssInputting the decoding network of the generator G for training, minimizing the loss function of the generator G in the training process, and obtaining the frequency spectrum characteristic of the reconstructed source emotion
Figure FDA0003454834360000021
Step 3-7, converting the frequency spectrum characteristic x of the emotion generated in the step 3-3stInputting the training result into a discriminator D to train, and minimizing a loss function of the discriminator D;
step 3-8, converting the frequency spectrum characteristic x of the emotion generated in the step 3-3stInput styleTraining the encoder S to minimize a style loss function of the style encoder S;
and 3-9, returning to the step 3-1, and repeating the steps until the set iteration number is reached, so as to obtain the trained StyleGAN-EVC network.
4. The method for StyleGAN emotion speaking and converting based on fundamental frequency difference compensation as claimed in claim 1, wherein said converting procedure of step 6 comprises the steps of:
step 6-1, carrying out frequency spectrum characteristic x of source emotionsInputting the coded network of the generator G trained in the step 3 to obtain semantic features G (x) irrelevant to emotions);
Step 6-2, frequency spectrum characteristic x of target emotiontInputting the style characteristics S of the target emotion into a style encoder St
Step 6-3, generating the semantic features G (x) irrelevant to emotions) And style characteristics s of target emotiontInputting the frequency spectrum characteristic x of the converted emotion into the decoding network of the generator G trained in the step 3st
5. The method of claim 1, wherein the stylistic reconstruction loss function of the stylistic coder (S) is expressed as:
Figure FDA0003454834360000031
wherein the content of the first and second substances,
Figure FDA0003454834360000038
expectation of difference between style features of target emotion generated by style encoder and style features of transformed emotion, | | · | | survival1Representing a 1 norm, S (-) being a stylistic coder, S (x)t) A style feature representing a target emotion generated by a style encoder, G (-) being a generator, G (x)s,S(xt) Spectral features representing the transformed emotion, S (G (x)) generated by the generators,S(xt) Stylistic features, x, representing the transformed emotion generated by the stylistic codersSpectral features of source emotion, xtIs the spectral feature of the target emotion.
6. The method for StyleGAN emotion speech conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein the objective function of the StyleGAN-EVC network is expressed as:
LStyleGAN=LG+LD
wherein L isGTo a loss function of the generator, LDIs a loss function of the discriminator;
loss function L of the generatorGExpressed as:
Figure FDA0003454834360000032
wherein λ iscycAnd λstyIs a group of regularization hyper-parameters which respectively represent the weight of the cycle consistency loss and the style reconstruction loss,
Figure FDA0003454834360000033
and
Figure FDA0003454834360000034
respectively representing the countermeasure loss, the cycle consistency loss and the style reconstruction loss of a style encoder of a generator;
loss function L of discriminatorDComprises the following steps:
Figure FDA0003454834360000035
wherein the content of the first and second substances,
Figure FDA0003454834360000036
is a discriminatorIs lost.
7. The method for StyleGAN emotion speech conversion based on fundamental frequency difference compensation as claimed in claim 1, wherein the fundamental frequency conversion function is:
Figure FDA0003454834360000037
wherein, musAnd mutMean, σ, of logarithmic fundamental frequency features representing source and target emotion, respectivelysAnd σtRespectively representing the mean square error of logarithmic fundamental frequency characteristics of the source emotion and the target emotion, and theta represents a fundamental frequency difference compensation vector;
the fundamental frequency difference compensation vector θ is expressed as:
Figure FDA0003454834360000042
wherein the content of the first and second substances,
Figure FDA0003454834360000041
the fundamental frequency characteristic mu which is obtained by linear interpolation or uniform sampling of the fundamental frequency characteristic representing the target emotiont' mean of fundamental frequency features representing target emotion.
CN202210004168.4A 2022-01-04 2022-01-04 StyleGAN emotion voice conversion method based on fundamental frequency difference compensation Pending CN114299917A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210004168.4A CN114299917A (en) 2022-01-04 2022-01-04 StyleGAN emotion voice conversion method based on fundamental frequency difference compensation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210004168.4A CN114299917A (en) 2022-01-04 2022-01-04 StyleGAN emotion voice conversion method based on fundamental frequency difference compensation

Publications (1)

Publication Number Publication Date
CN114299917A true CN114299917A (en) 2022-04-08

Family

ID=80975084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210004168.4A Pending CN114299917A (en) 2022-01-04 2022-01-04 StyleGAN emotion voice conversion method based on fundamental frequency difference compensation

Country Status (1)

Country Link
CN (1) CN114299917A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice
CN116072154A (en) * 2023-03-07 2023-05-05 华南师范大学 Speech emotion recognition method, device and equipment based on data enhancement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice
CN116072154A (en) * 2023-03-07 2023-05-05 华南师范大学 Speech emotion recognition method, device and equipment based on data enhancement

Similar Documents

Publication Publication Date Title
CN110060690B (en) Many-to-many speaker conversion method based on STARGAN and ResNet
CN109671442B (en) Many-to-many speaker conversion method based on STARGAN and x vectors
Palo et al. Wavelet based feature combination for recognition of emotions
CN110060701B (en) Many-to-many voice conversion method based on VAWGAN-AC
CN109599091B (en) Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN111785261A (en) Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN110189749A (en) Voice keyword automatic identifying method
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
Li et al. Freevc: Towards high-quality text-free one-shot voice conversion
CN114299917A (en) StyleGAN emotion voice conversion method based on fundamental frequency difference compensation
CN110060657B (en) SN-based many-to-many speaker conversion method
KR20200084443A (en) System and method for voice conversion
CN111462768A (en) Multi-scale StarGAN voice conversion method based on shared training
CN111429893A (en) Many-to-many speaker conversion method based on Transitive STARGAN
An et al. Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features
Zhao et al. Improved prosody from learned f0 codebook representations for vq-vae speech waveform reconstruction
Niwa et al. Statistical voice conversion based on WaveNet
Chen et al. SpeechFormer++: A hierarchical efficient framework for paralinguistic speech processing
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
Gao et al. Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling
Dossou et al. OkwuGb\'e: End-to-End Speech Recognition for Fon and Igbo
Zhao et al. Knowledge-aware bayesian co-attention for multimodal emotion recognition
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
Zhang et al. Semi-supervised learning based on reference model for low-resource tts
Li et al. Emotion recognition from speech with StarGAN and Dense‐DCNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination