CN114582363A

CN114582363A - High-quality voice conversion method for non-parallel corpus

Info

Publication number: CN114582363A
Application number: CN202210156203.4A
Authority: CN
Inventors: 简志华; 韦凤瑜; 徐嘉; 金宏辉; 章子旭; 吴迎笑; 游林; 吴超
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-06-03

Abstract

The invention relates to a voice conversion method for non-parallel linguistic data, which comprises the following steps: (1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion; (2) creating a time mask m with the same size as a source speaker Mel spectrogram x, adding m to x, and filling the lacking frames on x to obtain x'; (3) extracting fundamental tone frequency F0 of a source speaker, and converting F0 into fundamental frequency F0' of a target speaker through logarithmic Gaussian normalization transformation; (4) training a CycleGAN model, and adding a gradient penalty to the antagonistic loss; (5) varying the overall objective function; (6) inputting x 'obtained in (2) and (3), fundamental frequency F0' and created time mask m into generator G_X→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice; (7) feeding the obtained conversion Mel spectrogram yThe voice waveform is synthesized in the voice decoder to obtain the voice similar to the target speaker.

Description

High-quality voice conversion method for non-parallel corpus

Technical Field

The invention belongs to the technical field of voice conversion, and particularly relates to a non-parallel high-quality voice conversion method.

Background

Speech conversion refers to converting the personality characteristics of a source speaker to the personality characteristics of a target speaker so that the converted speech sounds consistent with the speech of the target speaker, while preserving the content of the source speech during the conversion process. With the increasing demand of people for personalized voices, voice conversion has been applied to various fields such as psychology, biomedicine, information security, and the like. The currently used voice conversion method can be divided into parallel conversion and non-parallel conversion according to whether there is parallel voice data. Parallel voice conversion has proposed many implementation techniques, and the development is mature, however, in practical applications, it is not easy to collect parallel voice data, and a time alignment preprocessing is required for parallel voice, and if the alignment is not accurate, the voice conversion effect is not good. The non-parallel voice conversion technology has no requirement on parallel data and time alignment, so that the data collection is simple and the cost is low, and most of the current research is inclined to the non-parallel voice conversion. However, the existing non-parallel voice conversion scheme still needs to be improved in the aspect of similarity.

Disclosure of Invention

Aiming at the defects in the prior art, the invention discloses a high-quality non-parallel voice conversion method, which uses a cyclic-dependent adaptive network (CycleGAN) to realize voice conversion, proposes to fill missing frames by using a time mask in a training stage and adds R1 zero center Gradient Penalties (GP) into the antagonistic loss of the CycleGAN so as to improve the naturalness and the similarity of converted voice and solve the problem of unstable CycleGAN training.

The invention adopts the following technical scheme:

the voice conversion method for the non-parallel linguistic data is carried out according to the following steps:

(1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion;

(2) creating a time mask m with the same size as a Mel spectrogram x of a source speaker, adding m to x, filling frames lacking in x, and obtaining a Mel spectrogram x' of the source voice after filling the frames;

(3) extracting fundamental tone frequency F0 of the source speaker, and converting F0 into fundamental frequency F0' of the target speaker through logarithmic Gaussian normalization transformation:

wherein, mu_x，σ_xAnd mu_y，σ_yThe mean and standard deviation of the source speaker and the target speaker on a logarithmic scale F0 respectively;

(4) the CycleGAN model is trained, adding a gradient penalty to the challenge loss, i.e.:

EP_D(X)what is what meaning? E (-) is the mathematically expected symbol, P_D(x)Is the distribution of the real data that is,

is a hyper-parameter;

(5) the overall objective function becomes:

L＝L'_adv+L_adv2+λ_cycL_cyc+λ_idL_id (16)

(6) obtained by the steps (2) and (3)X 'to, fundamental frequency F0' and created time mask m are input together into generator G_X→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice;

(7) and feeding the converted Mel spectrogram y' obtained in the previous step into a vocoder to synthesize a voice waveform, so as to obtain the voice similar to the target speaker.

Preferably, the step (2) is specifically as follows: giving an input Mel spectrogram x of a source speaker, using a time mask m with the same size as x, wherein the partial value of m is 0, the rest values are 1, and a zero region is randomly determined by a preset rule; the time mask m is added to the source Mel spectrum x, i.e.:

x'＝x·m (1)

generator G of cycleGAN_X→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:

y'＝G_X→Y(x',m,F0') (2)

using m as condition information, G_X→YFilling the missing frames, and adjusting the conversion of the Mel spectrum by the auxiliary feature F0'; for the resulting y', use the penalty to ensure that it is similar to the true target feature;

using inverse generators G_Y→XReconstruction x ", i.e.:

x”＝G_Y→X(y',m',F0') (3)

based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix; the use of a second pair of impairments ensures that the reconstructed x "is similar to the original x.

Preferably, in step (4), a gradient penalty is applied to the discriminator on the real sample by using an R1 zero center gradient penalty technology;

the regularization term for the R1 zero-center gradient penalty is defined as:

wherein the content of the first and second substances,e (-) is the mathematically expected symbol, P_D(x)Is the distribution of the real data that is,

is a hyper-parameter.

Preferably, in the CycleGAN model in the step (4), the generator G is trained by using four losses, and the mapping between X and Y is learned;

discriminator D_YThe expression for the penalty is:

L_adv(G_X→Y,D_Y)＝E_y～P(Y)[logD_Y(y)]+E_x～P(X)[log(1-D_Y(G_X→Y(x)))] (5)

discriminator D_XThe expression for the penalty is:

L_adv(G_Y→X,D_X)＝E_x～P(X)[logD_X(x)]+E_y～P(Y)[log(1-D_X(G_Y→X(y)))] (6)

wherein, P (X), P (Y) are the distribution of source speech data and target speech data respectively;

the antagonistic loss of CycleGAN is then:

L_adv＝L_adv(G_X→Y,D_Y)+L_adv(G_Y→X,D_X) (7)

the cyclic consistent loss is to preserve the speech content in the conversion process, and the expression is as follows:

L_cyc(G_X→Y,G_Y→X)＝E_x～P(X)[G_Y→X(G_X→Y(x))-x]+E_y～P(Y)[G_X→Y(G_Y→X(y))-y] (8)

the identity mapping loss is to better store the input and introduce the identity mapping loss, and the expression is

L_id(G_X→Y,G_Y→X)＝E_x～P(X)[G_Y→X(x)-x]+E_y～P(Y)[G_X→Y(y)-y] (9)

Adding an additional discriminator D_X' andthe cyclic conversion characteristic is to add one countermeasure loss, called the second countermeasure loss, namely:

similarly, an additional discriminator D is added during the reverse conversion_Y′，D_YThe second countermeasure loss of' is:

the second antagonistic loss for CycleGAN is then:

L_adv2＝L_adv2(G_X→Y,G_Y→X,D'_X)+L_adv2(G_X→Y,G_Y→X,D'_Y) (12)

the overall objective function of the resulting model is thus:

L＝L_adv+L_adv2+λ_cycL_cyc(G_X→Y,G_Y→X)+λ_idL_id(G_X→Y,G_Y→X) (13)

wherein λ is_cyc、λ_idRespectively, the hyperparameters of the cycle consistency loss and the identity mapping loss, and the weight of the corresponding loss is adjusted in the training process.

Preferably, the generator in the CycleGAN model is a complete convolution network, and comprises convolution layers, a gated linear unit, a 1-dimensional convolution and a 2-dimensional convolution, wherein the 2-dimensional convolution is applied to the down-sampling module and the up-sampling module; the 1-dimensional convolution is applied to the residual block and is responsible for the main conversion process; before and after characteristic input and output, 1 multiplied by 1 convolutional layer is applied to adjust the channel size, the number of input channels is 2, and the input channels are used for receiving m and x'; the gating linear unit is used for adaptively learning the sequence and the hierarchical structure of the acoustic features.

Preferably, the discriminator in the CycleGAN model is a 2-dimensional convolutional neural network, is used for discriminating data based on 2-dimensional spectral textures, and comprises a down-sampling module, a gate control linear unit and two convolutional layers, wherein the data are input from the former convolutional layer, sequentially pass through the gate control linear unit and the down-sampling module, and then pass through the last convolutional layer.

The technical scheme of the invention has the following advantages:

(1) the method adopts the time mask to fill the lacking frames in the Mel spectrogram, effectively protects the harmonic structure of the voice in the voice conversion process, and combines the use of the cycleGAN model to convert the voice characteristics, so that the generated converted voice has higher quality and shorter time consumption.

(2) The invention adds zero center gradient punishment technology in the training process of the CycleGAN model, and solves the problem of unstable training of the generated confrontation network by punishing the identifier deviating from Nash balance.

(3) The voice conversion method provided by the invention does not need parallel voice data, reduces the data collection cost and effectively saves the resource and time cost.

Drawings

Fig. 1 is a flow chart of a voice conversion method according to a preferred embodiment of the present invention.

FIG. 2 is a diagram of the training process of cycleGAN.

Fig. 3 is a diagram of a generator structure.

Fig. 4 is a view showing the construction of the discriminator.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings.

In this embodiment, a generative confrontation network is first trained to complete the conversion of Mel spectrum, and in order to improve the convergence performance of CycleGAN, a zero center gradient penalty is added to the confrontation loss of the generator during training. And then extracting a Mel spectrogram x of the source speaker voice, applying the created time mask m to the x, filling the missing frame, and obtaining the Mel spectrogram x' filled with the frame. The linearly transformed fundamental frequency F0 ' is then fed as an ancillary feature into the generator G, together with the time mask m and x ', to generate the converted target speech feature y '. Finally, y 'and F0' are used as the input of MelGAN vocoder to synthesize the voice waveform and obtain the converted target voice.

The preferred embodiment of the present invention counters the loss L by calculating_advEnsure that the generated speech features are approximately consistent with the target speech features and introduce a cyclic consistency loss L_cycNormalizing mappings between sources and targets, loss of identity mapping L_idFurther preserving language content, then adding zero center gradient punishment in the anti-loss, and stabilizing the training of the model to ensure that the generated anti-network approaches Nash balance. Loss function of the whole system is composed of the antagonistic loss L_advSecond pair of resistance loss L_adv2Cyclic consistent loss L_cycAnd identity mapping loss L_idAnd (4) forming. The flow of the voice conversion method of the present embodiment is shown in fig. 1, and the main contents of each part are described in detail below.

Filling up missing frames with time mask

In the preferred embodiment of the invention, the generator takes the previous and next frames as the basis to obtain useful information to fill in the missing frames through the cyclic conversion process. First, given an input source speaker's Mel-spectrum x, then a time mask m of the same size as x is used, with m having a partial value of 0 and the remainder of 1, and zero regions randomly determined by a predetermined rule. The time mask m is added to the source Mel spectrum x, i.e.:

x'＝x·m (1)

next, the generator G of cycleGAN_X→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:

y'＝G_X→Y(x',m,F0') (2)

using m as condition information, G_X→YThe missing frames can be filled in and the assist feature F0' adjusts the conversion of the Mel-spectrum. For the resulting y', a penalty is used to ensure that it is similar to the true target feature.

Then, using an inverse generator G_Y→XReconstruction x ", i.e.:

x”＝G_Y→X(y',m',F0') (3)

based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix. While using a second pair of impairments ensures that the reconstructed x "is similar to the original x.

Two, R1 zero center gradient penalty

In the traditional GAN training process, when the difference between the sample generated by the generator and the real sample is large, the discriminator optimizes the generator through gradient reduction, however, as the discriminator can distinguish the true and false of the sample more and more, although the sample generated by the generator is close to the real sample, the discriminator can judge the sample as false, so GAN is pushed away from nash balance, unstable training is caused, and the quality of the generated sample is poor.

The preferred embodiment of the present invention proposes to apply a gradient penalty to the discriminator on the true sample using a zero-centered gradient penalty technique, and when the sample generated by the generator is similar to the true sample, the discriminator will produce a gradient close to zero, preventing the generator from leaving the nash equilibrium.

The regularization term for the R1 zero-center gradient penalty is defined as:

where D (x) is the distribution of the real data,

is a hyper-parameter.

The invention adds a zero-center gradient penalty to the antagonistic loss of CycleGAN to stabilize the training process of the model.

Three, cycleGAN model

The method adopts a CycleGAN model to convert the Mel spectrogram of the source speech into the Mel spectrogram of the target speech. In CycleGAN, the generator G is trained with 4 losses, learning the mapping between X and Y. The training process of CycleGAN is shown in figure 2.

The countermeasure loss is used for measuring the similarity degree of the conversion characteristic and the target characteristic, and the smaller the countermeasure loss is, the more similar the converted acoustic characteristic and the target acoustic characteristic is, and the discriminator D_YThe expression for the penalty is:

likewise, discriminator D_XThe challenge loss of (a) is:

wherein, p (x), p (y) are distributions of source speech data and target speech data, respectively.

The antagonistic loss of CycleGAN is then:

L_adv＝L_adv(G_X→Y,D_Y)+L_adv(G_Y→X,D_X) (7)

the cyclic consistent loss is to preserve the speech content during the conversion process and is expressed as:

L_id(G_X→Y,G_Y→X)＝E_x～P(X)[G_Y→X(x)-x]+E_y～P(Y)[G_X→Y(y)-y] (9)

The second countermeasure loss is to alleviate the over-smoothing effect caused by the use of the cyclic consistent loss, and an additional discriminator D is added_X', and the cyclic switching feature is to add one penalty, called the second penalty, namely:

the second antagonistic loss for CycleGAN is then:

L_adv2＝L_adv2(G_X→Y,G_Y→X,D'_X)+L_adv2(G_X→Y,G_Y→X,D'_Y) (12)

thus, the overall objective function of the available model is:

L＝L_adv+L_adv2+λ_cycL_cyc(G_X→Y,G_Y→X)+λ_idL_id(G_X→Y,G_Y→X) (13)

A generator in the cycleGAN model is a complete convolution network and is composed of a 1-dimensional CNN, a 2-dimensional CNN and the like, and the 2-dimensional convolution is applied to a down-sampling module and an up-sampling module, so that the time structure can be reserved while the overall relation and direction of input features are effectively captured. The 1-dimensional convolution is applied in the residual block and is responsible for the main conversion process. Before and after the feature input and output, 1 × 1 convolution is applied to adjust the channel size, and the number of input channels is 2, which is used to receive m and x'. A Gated Linear Unit (GLU) is used to adaptively learn the order and hierarchy of acoustic features. The structure of the generator is shown in fig. 3.

The discriminator is a 2-dimensional convolution neural network and is used for discriminating data based on 2-dimensional spectrum textures, the discriminator mainly comprises a down-sampling module, a gate control linear unit and a convolution layer, and the convolution of the last layer is used for reducing the number of parameters and stabilizing the training of the GAN model. The structure is shown in fig. 4.

Fourth, voice conversion process

The method proposed by the preferred embodiment of the present invention is mainly composed of two parts: the first part is that a zero center gradient punishment is added in the training process of the CycleGAN, the problem of gradient disappearance during GAN training is solved, and the trained CycleGAN model is used for synthesizing a target Mel spectrum from a source speaker Mel spectrum, a time mask and a fundamental frequency. And the second part is to fill the extracted place where the Mel spectrum lacks frames, and the created time mask and the Mel spectrum are used as products to obtain the voice characteristics for conversion. And finally, sending the converted target Mel spectrogram to a vocoder, synthesizing a voice waveform to obtain converted voice, namely generating the voice with the identity information of the target speaker and reserving the content of the source speaker.

The specific process of the voice conversion in this embodiment is as follows:

(1) and acquiring a voice database of the source speaker, and extracting the Mel spectrogram x of the source speaker as voice characteristics for conversion.

(2) And creating a time mask m with the same size as the source speaker Mel spectrogram x, adding m to x, and filling the missing frames on x to obtain x'.

wherein, mu_x，σ_xAnd mu_y，σ_yThe mean and standard deviation of the source speaker and the target speaker, respectively, on a logarithmic scale of F0.

(5) the overall objective function becomes:

L＝L'_adv+L_adv2+λ_cycL_cyc+λ_idL_id (16)

(6) inputting x ', F0' obtained in the steps (2) and (3) and the created m into a generator G_X→YIn the method, F0 ' is used as an auxiliary feature to adjust the conversion direction of the Mel spectrogram, and the generator converts x ' into the Mel spectrogram y ' of the target voice.

(7) And feeding the converted Mel spectrogram y' obtained in the last step into a vocoder to synthesize a voice waveform, so as to obtain high-quality voice similar to the target speaker.

The foregoing is considered as illustrative only of the preferred embodiments of the invention and accompanying technical principles. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments, but is capable of various obvious changes, rearrangements and substitutions without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. The speech conversion method for the non-parallel corpus is characterized by comprising the following steps of:

wherein, mu_x，σ_xAnd mu_y，σ_yThe mean and standard deviation of the source speaker and the target speaker, respectively, on a logarithmic scale of F0;

e (-) is the mathematically expected symbol, P_D(x)Is the distribution of the real data that is,

is a hyper-parameter;

(5) the overall objective function becomes:

L＝L'_adv+L_adv2+λ_cycL_cyc+λ_idL_id (16)

(6) inputting x 'obtained in the steps (2) and (3), a fundamental frequency F0' and the created time mask m into a generator G_X→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice;

2. The method as claimed in claim 1, wherein the step (2) is as follows: giving an input Mel spectrogram x of a source speaker, using a time mask m with the same size as x, wherein the partial value of m is 0, the rest values are 1, and a zero region is randomly determined by a preset rule; the time mask m is added to the source Mel spectrum x, i.e.:

x'＝x·m (1)

y'＝G_X→Y(x',m,F0') (2)

using m as condition information, G_X→YFilling in missing frames, auxiliary featuresThe feature F0' adjusts the conversion of Mel spectrum; for the resulting y', use the penalty to ensure that it is similar to the true target feature;

using inverse generators G_Y→XReconstruction x ", i.e.:

x”＝G_Y→X(y',m',F0') (3)

3. The speech conversion method for non-parallel corpora according to claim 1 or 2, wherein in the step (4), a gradient penalty is applied to the discriminator on the real sample using an R1 zero-center gradient penalty technique;

the regularization term for the R1 zero-center gradient penalty is defined as:

wherein E (-) is the mathematically expected symbol, P_D(x)Is the distribution of the real data that is,

is a hyper-parameter.

4. The speech conversion method for non-parallel corpora according to claim 3, wherein in the CycleGAN model of step (4), the generator G is trained using four losses to learn the mapping between X and Y;

discriminator D_YThe expression for the penalty is:

discriminator D_XThe expression for the penalty is:

the antagonistic loss of CycleGAN is then:

L_adv＝L_adv(G_X→Y,D_Y)+L_adv(G_Y→X,D_X) (7)

the cyclic consistent loss is to preserve the speech content during the conversion process, and the expression is:

L_id(G_X→Y,G_Y→X)＝E_x～P(X)[G_Y→X(x)-x]+E_y～P(Y)[G_X→Y(y)-y] (9)

Adding an additional discriminator D_X', and the cyclic conversion characteristic is to add one antagonistic loss, called the second antagonistic loss, namely:

the second antagonistic loss for CycleGAN is then:

L_adv2＝L_adv2(G_X→Y,G_Y→X,D'_X)+L_adv2(G_X→Y,G_Y→X,D'_Y) (12)

the overall objective function of the resulting model is thus:

L＝L_adv+L_adv2+λ_cycL_cyc(G_X→Y,G_Y→X)+λ_idL_id(G_X→Y,G_Y→X) (13)

5. The speech conversion method for non-parallel corpora according to claim 4, wherein the generator in the CycleGAN model is a complete convolution network including convolutional layers, gated linear units, 1-dimensional convolution and 2-dimensional convolution, the 2-dimensional convolution being applied to the down-sampling and up-sampling modules; the 1-dimensional convolution is applied to the residual block and is responsible for the main conversion process; before and after characteristic input and output, 1 multiplied by 1 convolutional layer is applied to adjust the channel size, the number of input channels is 2, and the input channels are used for receiving m and x'; the gating linear unit is used for adaptively learning the sequence and the hierarchical structure of the acoustic features.

6. The method as claimed in claim 4, wherein the discriminator in the CycleGAN model is a 2-dimensional convolutional neural network for discriminating data based on 2-dimensional spectral textures, and comprises a down-sampling module, a gate-controlled linear unit and two convolutional layers, which are inputted from the previous convolutional layer, sequentially pass through the gate-controlled linear unit and the down-sampling module, and then pass through the last convolutional layer.