CN114582363A - High-quality voice conversion method for non-parallel corpus - Google Patents

High-quality voice conversion method for non-parallel corpus Download PDF

Info

Publication number
CN114582363A
CN114582363A CN202210156203.4A CN202210156203A CN114582363A CN 114582363 A CN114582363 A CN 114582363A CN 202210156203 A CN202210156203 A CN 202210156203A CN 114582363 A CN114582363 A CN 114582363A
Authority
CN
China
Prior art keywords
voice
loss
conversion
speaker
mel spectrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210156203.4A
Other languages
Chinese (zh)
Inventor
简志华
韦凤瑜
徐嘉
金宏辉
章子旭
吴迎笑
游林
吴超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210156203.4A priority Critical patent/CN114582363A/en
Publication of CN114582363A publication Critical patent/CN114582363A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention relates to a voice conversion method for non-parallel linguistic data, which comprises the following steps: (1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion; (2) creating a time mask m with the same size as a source speaker Mel spectrogram x, adding m to x, and filling the lacking frames on x to obtain x'; (3) extracting fundamental tone frequency F0 of a source speaker, and converting F0 into fundamental frequency F0' of a target speaker through logarithmic Gaussian normalization transformation; (4) training a CycleGAN model, and adding a gradient penalty to the antagonistic loss; (5) varying the overall objective function; (6) inputting x 'obtained in (2) and (3), fundamental frequency F0' and created time mask m into generator GX→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice; (7) feeding the obtained conversion Mel spectrogram yThe voice waveform is synthesized in the voice decoder to obtain the voice similar to the target speaker.

Description

High-quality voice conversion method for non-parallel corpus
Technical Field
The invention belongs to the technical field of voice conversion, and particularly relates to a non-parallel high-quality voice conversion method.
Background
Speech conversion refers to converting the personality characteristics of a source speaker to the personality characteristics of a target speaker so that the converted speech sounds consistent with the speech of the target speaker, while preserving the content of the source speech during the conversion process. With the increasing demand of people for personalized voices, voice conversion has been applied to various fields such as psychology, biomedicine, information security, and the like. The currently used voice conversion method can be divided into parallel conversion and non-parallel conversion according to whether there is parallel voice data. Parallel voice conversion has proposed many implementation techniques, and the development is mature, however, in practical applications, it is not easy to collect parallel voice data, and a time alignment preprocessing is required for parallel voice, and if the alignment is not accurate, the voice conversion effect is not good. The non-parallel voice conversion technology has no requirement on parallel data and time alignment, so that the data collection is simple and the cost is low, and most of the current research is inclined to the non-parallel voice conversion. However, the existing non-parallel voice conversion scheme still needs to be improved in the aspect of similarity.
Disclosure of Invention
Aiming at the defects in the prior art, the invention discloses a high-quality non-parallel voice conversion method, which uses a cyclic-dependent adaptive network (CycleGAN) to realize voice conversion, proposes to fill missing frames by using a time mask in a training stage and adds R1 zero center Gradient Penalties (GP) into the antagonistic loss of the CycleGAN so as to improve the naturalness and the similarity of converted voice and solve the problem of unstable CycleGAN training.
The invention adopts the following technical scheme:
the voice conversion method for the non-parallel linguistic data is carried out according to the following steps:
(1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion;
(2) creating a time mask m with the same size as a Mel spectrogram x of a source speaker, adding m to x, filling frames lacking in x, and obtaining a Mel spectrogram x' of the source voice after filling the frames;
(3) extracting fundamental tone frequency F0 of the source speaker, and converting F0 into fundamental frequency F0' of the target speaker through logarithmic Gaussian normalization transformation:
Figure BDA0003512725560000011
wherein, mux,σxAnd muy,σyThe mean and standard deviation of the source speaker and the target speaker on a logarithmic scale F0 respectively;
(4) the CycleGAN model is trained, adding a gradient penalty to the challenge loss, i.e.:
Figure BDA0003512725560000012
EPD(X)what is what meaning? E (-) is the mathematically expected symbol, PD(x)Is the distribution of the real data that is,
Figure BDA0003512725560000021
is a hyper-parameter;
(5) the overall objective function becomes:
L=L'adv+Ladv2cycLcycidLid (16)
(6) obtained by the steps (2) and (3)X 'to, fundamental frequency F0' and created time mask m are input together into generator GX→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice;
(7) and feeding the converted Mel spectrogram y' obtained in the previous step into a vocoder to synthesize a voice waveform, so as to obtain the voice similar to the target speaker.
Preferably, the step (2) is specifically as follows: giving an input Mel spectrogram x of a source speaker, using a time mask m with the same size as x, wherein the partial value of m is 0, the rest values are 1, and a zero region is randomly determined by a preset rule; the time mask m is added to the source Mel spectrum x, i.e.:
x'=x·m (1)
generator G of cycleGANX→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:
y'=GX→Y(x',m,F0') (2)
using m as condition information, GX→YFilling the missing frames, and adjusting the conversion of the Mel spectrum by the auxiliary feature F0'; for the resulting y', use the penalty to ensure that it is similar to the true target feature;
using inverse generators GY→XReconstruction x ", i.e.:
x”=GY→X(y',m',F0') (3)
based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix; the use of a second pair of impairments ensures that the reconstructed x "is similar to the original x.
Preferably, in step (4), a gradient penalty is applied to the discriminator on the real sample by using an R1 zero center gradient penalty technology;
the regularization term for the R1 zero-center gradient penalty is defined as:
Figure BDA0003512725560000022
wherein the content of the first and second substances,e (-) is the mathematically expected symbol, PD(x)Is the distribution of the real data that is,
Figure BDA0003512725560000023
is a hyper-parameter.
Preferably, in the CycleGAN model in the step (4), the generator G is trained by using four losses, and the mapping between X and Y is learned;
discriminator DYThe expression for the penalty is:
Ladv(GX→Y,DY)=Ey~P(Y)[logDY(y)]+Ex~P(X)[log(1-DY(GX→Y(x)))] (5)
discriminator DXThe expression for the penalty is:
Ladv(GY→X,DX)=Ex~P(X)[logDX(x)]+Ey~P(Y)[log(1-DX(GY→X(y)))] (6)
wherein, P (X), P (Y) are the distribution of source speech data and target speech data respectively;
the antagonistic loss of CycleGAN is then:
Ladv=Ladv(GX→Y,DY)+Ladv(GY→X,DX) (7)
the cyclic consistent loss is to preserve the speech content in the conversion process, and the expression is as follows:
Lcyc(GX→Y,GY→X)=Ex~P(X)[GY→X(GX→Y(x))-x]+Ey~P(Y)[GX→Y(GY→X(y))-y] (8)
the identity mapping loss is to better store the input and introduce the identity mapping loss, and the expression is
Lid(GX→Y,GY→X)=Ex~P(X)[GY→X(x)-x]+Ey~P(Y)[GX→Y(y)-y] (9)
Adding an additional discriminator DX' andthe cyclic conversion characteristic is to add one countermeasure loss, called the second countermeasure loss, namely:
Figure BDA0003512725560000031
similarly, an additional discriminator D is added during the reverse conversionY′,DYThe second countermeasure loss of' is:
Figure BDA0003512725560000032
the second antagonistic loss for CycleGAN is then:
Ladv2=Ladv2(GX→Y,GY→X,D'X)+Ladv2(GX→Y,GY→X,D'Y) (12)
the overall objective function of the resulting model is thus:
L=Ladv+Ladv2cycLcyc(GX→Y,GY→X)+λidLid(GX→Y,GY→X) (13)
wherein λ iscyc、λidRespectively, the hyperparameters of the cycle consistency loss and the identity mapping loss, and the weight of the corresponding loss is adjusted in the training process.
Preferably, the generator in the CycleGAN model is a complete convolution network, and comprises convolution layers, a gated linear unit, a 1-dimensional convolution and a 2-dimensional convolution, wherein the 2-dimensional convolution is applied to the down-sampling module and the up-sampling module; the 1-dimensional convolution is applied to the residual block and is responsible for the main conversion process; before and after characteristic input and output, 1 multiplied by 1 convolutional layer is applied to adjust the channel size, the number of input channels is 2, and the input channels are used for receiving m and x'; the gating linear unit is used for adaptively learning the sequence and the hierarchical structure of the acoustic features.
Preferably, the discriminator in the CycleGAN model is a 2-dimensional convolutional neural network, is used for discriminating data based on 2-dimensional spectral textures, and comprises a down-sampling module, a gate control linear unit and two convolutional layers, wherein the data are input from the former convolutional layer, sequentially pass through the gate control linear unit and the down-sampling module, and then pass through the last convolutional layer.
The technical scheme of the invention has the following advantages:
(1) the method adopts the time mask to fill the lacking frames in the Mel spectrogram, effectively protects the harmonic structure of the voice in the voice conversion process, and combines the use of the cycleGAN model to convert the voice characteristics, so that the generated converted voice has higher quality and shorter time consumption.
(2) The invention adds zero center gradient punishment technology in the training process of the CycleGAN model, and solves the problem of unstable training of the generated confrontation network by punishing the identifier deviating from Nash balance.
(3) The voice conversion method provided by the invention does not need parallel voice data, reduces the data collection cost and effectively saves the resource and time cost.
Drawings
Fig. 1 is a flow chart of a voice conversion method according to a preferred embodiment of the present invention.
FIG. 2 is a diagram of the training process of cycleGAN.
Fig. 3 is a diagram of a generator structure.
Fig. 4 is a view showing the construction of the discriminator.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings.
In this embodiment, a generative confrontation network is first trained to complete the conversion of Mel spectrum, and in order to improve the convergence performance of CycleGAN, a zero center gradient penalty is added to the confrontation loss of the generator during training. And then extracting a Mel spectrogram x of the source speaker voice, applying the created time mask m to the x, filling the missing frame, and obtaining the Mel spectrogram x' filled with the frame. The linearly transformed fundamental frequency F0 ' is then fed as an ancillary feature into the generator G, together with the time mask m and x ', to generate the converted target speech feature y '. Finally, y 'and F0' are used as the input of MelGAN vocoder to synthesize the voice waveform and obtain the converted target voice.
The preferred embodiment of the present invention counters the loss L by calculatingadvEnsure that the generated speech features are approximately consistent with the target speech features and introduce a cyclic consistency loss LcycNormalizing mappings between sources and targets, loss of identity mapping LidFurther preserving language content, then adding zero center gradient punishment in the anti-loss, and stabilizing the training of the model to ensure that the generated anti-network approaches Nash balance. Loss function of the whole system is composed of the antagonistic loss LadvSecond pair of resistance loss Ladv2Cyclic consistent loss LcycAnd identity mapping loss LidAnd (4) forming. The flow of the voice conversion method of the present embodiment is shown in fig. 1, and the main contents of each part are described in detail below.
Filling up missing frames with time mask
In the preferred embodiment of the invention, the generator takes the previous and next frames as the basis to obtain useful information to fill in the missing frames through the cyclic conversion process. First, given an input source speaker's Mel-spectrum x, then a time mask m of the same size as x is used, with m having a partial value of 0 and the remainder of 1, and zero regions randomly determined by a predetermined rule. The time mask m is added to the source Mel spectrum x, i.e.:
x'=x·m (1)
next, the generator G of cycleGANX→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:
y'=GX→Y(x',m,F0') (2)
using m as condition information, GX→YThe missing frames can be filled in and the assist feature F0' adjusts the conversion of the Mel-spectrum. For the resulting y', a penalty is used to ensure that it is similar to the true target feature.
Then, using an inverse generator GY→XReconstruction x ", i.e.:
x”=GY→X(y',m',F0') (3)
based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix. While using a second pair of impairments ensures that the reconstructed x "is similar to the original x.
Two, R1 zero center gradient penalty
In the traditional GAN training process, when the difference between the sample generated by the generator and the real sample is large, the discriminator optimizes the generator through gradient reduction, however, as the discriminator can distinguish the true and false of the sample more and more, although the sample generated by the generator is close to the real sample, the discriminator can judge the sample as false, so GAN is pushed away from nash balance, unstable training is caused, and the quality of the generated sample is poor.
The preferred embodiment of the present invention proposes to apply a gradient penalty to the discriminator on the true sample using a zero-centered gradient penalty technique, and when the sample generated by the generator is similar to the true sample, the discriminator will produce a gradient close to zero, preventing the generator from leaving the nash equilibrium.
The regularization term for the R1 zero-center gradient penalty is defined as:
Figure BDA0003512725560000051
where D (x) is the distribution of the real data,
Figure BDA0003512725560000052
is a hyper-parameter.
The invention adds a zero-center gradient penalty to the antagonistic loss of CycleGAN to stabilize the training process of the model.
Three, cycleGAN model
The method adopts a CycleGAN model to convert the Mel spectrogram of the source speech into the Mel spectrogram of the target speech. In CycleGAN, the generator G is trained with 4 losses, learning the mapping between X and Y. The training process of CycleGAN is shown in figure 2.
The countermeasure loss is used for measuring the similarity degree of the conversion characteristic and the target characteristic, and the smaller the countermeasure loss is, the more similar the converted acoustic characteristic and the target acoustic characteristic is, and the discriminator DYThe expression for the penalty is:
Ladv(GX→Y,DY)=Ey~P(Y)[logDY(y)]+Ex~P(X)[log(1-DY(GX→Y(x)))] (5)
likewise, discriminator DXThe challenge loss of (a) is:
Ladv(GY→X,DX)=Ex~P(X)[logDX(x)]+Ey~P(Y)[log(1-DX(GY→X(y)))] (6)
wherein, p (x), p (y) are distributions of source speech data and target speech data, respectively.
The antagonistic loss of CycleGAN is then:
Ladv=Ladv(GX→Y,DY)+Ladv(GY→X,DX) (7)
the cyclic consistent loss is to preserve the speech content during the conversion process and is expressed as:
Lcyc(GX→Y,GY→X)=Ex~P(X)[GY→X(GX→Y(x))-x]+Ey~P(Y)[GX→Y(GY→X(y))-y] (8)
the identity mapping loss is to better store the input and introduce the identity mapping loss, and the expression is
Lid(GX→Y,GY→X)=Ex~P(X)[GY→X(x)-x]+Ey~P(Y)[GX→Y(y)-y] (9)
The second countermeasure loss is to alleviate the over-smoothing effect caused by the use of the cyclic consistent loss, and an additional discriminator D is addedX', and the cyclic switching feature is to add one penalty, called the second penalty, namely:
Figure BDA0003512725560000061
similarly, an additional discriminator D is added during the reverse conversionY′,DYThe second countermeasure loss of' is:
Figure BDA0003512725560000062
the second antagonistic loss for CycleGAN is then:
Ladv2=Ladv2(GX→Y,GY→X,D'X)+Ladv2(GX→Y,GY→X,D'Y) (12)
thus, the overall objective function of the available model is:
L=Ladv+Ladv2cycLcyc(GX→Y,GY→X)+λidLid(GX→Y,GY→X) (13)
wherein λ iscyc、λidRespectively, the hyperparameters of the cycle consistency loss and the identity mapping loss, and the weight of the corresponding loss is adjusted in the training process.
A generator in the cycleGAN model is a complete convolution network and is composed of a 1-dimensional CNN, a 2-dimensional CNN and the like, and the 2-dimensional convolution is applied to a down-sampling module and an up-sampling module, so that the time structure can be reserved while the overall relation and direction of input features are effectively captured. The 1-dimensional convolution is applied in the residual block and is responsible for the main conversion process. Before and after the feature input and output, 1 × 1 convolution is applied to adjust the channel size, and the number of input channels is 2, which is used to receive m and x'. A Gated Linear Unit (GLU) is used to adaptively learn the order and hierarchy of acoustic features. The structure of the generator is shown in fig. 3.
The discriminator is a 2-dimensional convolution neural network and is used for discriminating data based on 2-dimensional spectrum textures, the discriminator mainly comprises a down-sampling module, a gate control linear unit and a convolution layer, and the convolution of the last layer is used for reducing the number of parameters and stabilizing the training of the GAN model. The structure is shown in fig. 4.
Fourth, voice conversion process
The method proposed by the preferred embodiment of the present invention is mainly composed of two parts: the first part is that a zero center gradient punishment is added in the training process of the CycleGAN, the problem of gradient disappearance during GAN training is solved, and the trained CycleGAN model is used for synthesizing a target Mel spectrum from a source speaker Mel spectrum, a time mask and a fundamental frequency. And the second part is to fill the extracted place where the Mel spectrum lacks frames, and the created time mask and the Mel spectrum are used as products to obtain the voice characteristics for conversion. And finally, sending the converted target Mel spectrogram to a vocoder, synthesizing a voice waveform to obtain converted voice, namely generating the voice with the identity information of the target speaker and reserving the content of the source speaker.
The specific process of the voice conversion in this embodiment is as follows:
(1) and acquiring a voice database of the source speaker, and extracting the Mel spectrogram x of the source speaker as voice characteristics for conversion.
(2) And creating a time mask m with the same size as the source speaker Mel spectrogram x, adding m to x, and filling the missing frames on x to obtain x'.
(3) Extracting fundamental tone frequency F0 of the source speaker, and converting F0 into fundamental frequency F0' of the target speaker through logarithmic Gaussian normalization transformation:
Figure BDA0003512725560000071
wherein, mux,σxAnd muy,σyThe mean and standard deviation of the source speaker and the target speaker, respectively, on a logarithmic scale of F0.
(4) The CycleGAN model is trained, adding a gradient penalty to the challenge loss, i.e.:
Figure BDA0003512725560000072
(5) the overall objective function becomes:
L=L'adv+Ladv2cycLcycidLid (16)
(6) inputting x ', F0' obtained in the steps (2) and (3) and the created m into a generator GX→YIn the method, F0 ' is used as an auxiliary feature to adjust the conversion direction of the Mel spectrogram, and the generator converts x ' into the Mel spectrogram y ' of the target voice.
(7) And feeding the converted Mel spectrogram y' obtained in the last step into a vocoder to synthesize a voice waveform, so as to obtain high-quality voice similar to the target speaker.
The foregoing is considered as illustrative only of the preferred embodiments of the invention and accompanying technical principles. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments, but is capable of various obvious changes, rearrangements and substitutions without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (6)

1. The speech conversion method for the non-parallel corpus is characterized by comprising the following steps of:
(1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion;
(2) creating a time mask m with the same size as a Mel spectrogram x of a source speaker, adding m to x, filling frames lacking in x, and obtaining a Mel spectrogram x' of the source voice after filling the frames;
(3) extracting fundamental tone frequency F0 of the source speaker, and converting F0 into fundamental frequency F0' of the target speaker through logarithmic Gaussian normalization transformation:
Figure FDA0003512725550000011
wherein, mux,σxAnd muy,σyThe mean and standard deviation of the source speaker and the target speaker, respectively, on a logarithmic scale of F0;
(4) the CycleGAN model is trained, adding a gradient penalty to the challenge loss, i.e.:
Figure FDA0003512725550000012
e (-) is the mathematically expected symbol, PD(x)Is the distribution of the real data that is,
Figure FDA0003512725550000013
is a hyper-parameter;
(5) the overall objective function becomes:
L=L'adv+Ladv2cycLcycidLid (16)
(6) inputting x 'obtained in the steps (2) and (3), a fundamental frequency F0' and the created time mask m into a generator GX→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice;
(7) and feeding the converted Mel spectrogram y' obtained in the previous step into a vocoder to synthesize a voice waveform, so as to obtain the voice similar to the target speaker.
2. The method as claimed in claim 1, wherein the step (2) is as follows: giving an input Mel spectrogram x of a source speaker, using a time mask m with the same size as x, wherein the partial value of m is 0, the rest values are 1, and a zero region is randomly determined by a preset rule; the time mask m is added to the source Mel spectrum x, i.e.:
x'=x·m (1)
generator G of cycleGANX→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:
y'=GX→Y(x',m,F0') (2)
using m as condition information, GX→YFilling in missing frames, auxiliary featuresThe feature F0' adjusts the conversion of Mel spectrum; for the resulting y', use the penalty to ensure that it is similar to the true target feature;
using inverse generators GY→XReconstruction x ", i.e.:
x”=GY→X(y',m',F0') (3)
based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix; the use of a second pair of impairments ensures that the reconstructed x "is similar to the original x.
3. The speech conversion method for non-parallel corpora according to claim 1 or 2, wherein in the step (4), a gradient penalty is applied to the discriminator on the real sample using an R1 zero-center gradient penalty technique;
the regularization term for the R1 zero-center gradient penalty is defined as:
Figure FDA0003512725550000021
wherein E (-) is the mathematically expected symbol, PD(x)Is the distribution of the real data that is,
Figure FDA0003512725550000022
is a hyper-parameter.
4. The speech conversion method for non-parallel corpora according to claim 3, wherein in the CycleGAN model of step (4), the generator G is trained using four losses to learn the mapping between X and Y;
discriminator DYThe expression for the penalty is:
Ladv(GX→Y,DY)=Ey~P(Y)[logDY(y)]+Ex~P(X)[log(1-DY(GX→Y(x)))] (5)
discriminator DXThe expression for the penalty is:
Ladv(GY→X,DX)=Ex~P(X)[logDX(x)]+Ey~P(Y)[log(1-DX(GY→X(y)))] (6)
wherein, P (X), P (Y) are the distribution of source speech data and target speech data respectively;
the antagonistic loss of CycleGAN is then:
Ladv=Ladv(GX→Y,DY)+Ladv(GY→X,DX) (7)
the cyclic consistent loss is to preserve the speech content during the conversion process, and the expression is:
Lcyc(GX→Y,GY→X)=Ex~P(X)[GY→X(GX→Y(x))-x]+Ey~P(Y)[GX→Y(GY→X(y))-y] (8)
the identity mapping loss is to better store the input and introduce the identity mapping loss, and the expression is
Lid(GX→Y,GY→X)=Ex~P(X)[GY→X(x)-x]+Ey~P(Y)[GX→Y(y)-y] (9)
Adding an additional discriminator DX', and the cyclic conversion characteristic is to add one antagonistic loss, called the second antagonistic loss, namely:
Figure FDA0003512725550000023
similarly, an additional discriminator D is added during the reverse conversionY′,DYThe second countermeasure loss of' is:
Figure FDA0003512725550000024
the second antagonistic loss for CycleGAN is then:
Ladv2=Ladv2(GX→Y,GY→X,D'X)+Ladv2(GX→Y,GY→X,D'Y) (12)
the overall objective function of the resulting model is thus:
L=Ladv+Ladv2cycLcyc(GX→Y,GY→X)+λidLid(GX→Y,GY→X) (13)
wherein λ iscyc、λidRespectively, the hyperparameters of the cycle consistency loss and the identity mapping loss, and the weight of the corresponding loss is adjusted in the training process.
5. The speech conversion method for non-parallel corpora according to claim 4, wherein the generator in the CycleGAN model is a complete convolution network including convolutional layers, gated linear units, 1-dimensional convolution and 2-dimensional convolution, the 2-dimensional convolution being applied to the down-sampling and up-sampling modules; the 1-dimensional convolution is applied to the residual block and is responsible for the main conversion process; before and after characteristic input and output, 1 multiplied by 1 convolutional layer is applied to adjust the channel size, the number of input channels is 2, and the input channels are used for receiving m and x'; the gating linear unit is used for adaptively learning the sequence and the hierarchical structure of the acoustic features.
6. The method as claimed in claim 4, wherein the discriminator in the CycleGAN model is a 2-dimensional convolutional neural network for discriminating data based on 2-dimensional spectral textures, and comprises a down-sampling module, a gate-controlled linear unit and two convolutional layers, which are inputted from the previous convolutional layer, sequentially pass through the gate-controlled linear unit and the down-sampling module, and then pass through the last convolutional layer.
CN202210156203.4A 2022-02-21 2022-02-21 High-quality voice conversion method for non-parallel corpus Pending CN114582363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210156203.4A CN114582363A (en) 2022-02-21 2022-02-21 High-quality voice conversion method for non-parallel corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210156203.4A CN114582363A (en) 2022-02-21 2022-02-21 High-quality voice conversion method for non-parallel corpus

Publications (1)

Publication Number Publication Date
CN114582363A true CN114582363A (en) 2022-06-03

Family

ID=81771061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210156203.4A Pending CN114582363A (en) 2022-02-21 2022-02-21 High-quality voice conversion method for non-parallel corpus

Country Status (1)

Country Link
CN (1) CN114582363A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Similar Documents

Publication Publication Date Title
CN109671442B (en) Many-to-many speaker conversion method based on STARGAN and x vectors
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
CN110060701A (en) Multi-to-multi phonetics transfer method based on VAWGAN-AC
CN101064104A (en) Emotion voice creating method based on voice conversion
CN112331183B (en) Non-parallel corpus voice conversion method and system based on autoregressive network
CN111429894A (en) Many-to-many speaker conversion method based on SE-ResNet STARGAN
CN110189766B (en) Voice style transfer method based on neural network
CN110047501A (en) Multi-to-multi phonetics transfer method based on beta-VAE
CN114582363A (en) High-quality voice conversion method for non-parallel corpus
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN113593588B (en) Multi-singer singing voice synthesis method and system based on generation of countermeasure network
Fu et al. Cycletransgan-evc: A cyclegan-based emotional voice conversion model with transformer
CN113066475B (en) Speech synthesis method based on generating type countermeasure network
Moritani et al. Stargan-based emotional voice conversion for japanese phrases
CN102930863A (en) Voice conversion and reconstruction method based on simplified self-adaptive interpolation weighting spectrum model
Guo et al. Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training
Yook et al. Voice conversion using conditional CycleGAN
Gao et al. Personalized Singing Voice Generation Using WaveRNN.
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
Tobing et al. Voice conversion with CycleRNN-based spectral mapping and finely tuned WaveNet vocoder
CN108417198A (en) A kind of men and women's phonetics transfer method based on spectrum envelope and pitch period
Zhao et al. Research on voice cloning with a few samples
Tobing et al. Low-latency real-time non-parallel voice conversion based on cyclic variational autoencoder and multiband WaveRNN with data-driven linear prediction
Zhao et al. Singing voice conversion based on wd-gan algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination