CN114882897A - Training of voice conversion model, voice conversion method, device and related equipment - Google Patents

Training of voice conversion model, voice conversion method, device and related equipment Download PDF

Info

Publication number
CN114882897A
CN114882897A CN202210517643.8A CN202210517643A CN114882897A CN 114882897 A CN114882897 A CN 114882897A CN 202210517643 A CN202210517643 A CN 202210517643A CN 114882897 A CN114882897 A CN 114882897A
Authority
CN
China
Prior art keywords
voice data
gradient
generator
gamut
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210517643.8A
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210517643.8A priority Critical patent/CN114882897A/en
Publication of CN114882897A publication Critical patent/CN114882897A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to an artificial intelligence technology, and provides a method, a device and related equipment for training a voice conversion model, wherein the method comprises the following steps: inputting the first tone color domain voice data and the second tone color domain voice data into a circularly generated confrontation network to train the circularly generated confrontation network, and acquiring a judgment result of a discriminator in the circularly generated confrontation network; if the generator in the cyclic generation countermeasure network needs to be optimized and gradient inversion is determined according to the judgment result, inverting the gradient calculated according to the loss function corresponding to the generator, and updating the model parameters of the generator according to the inverted gradient; and if the gradient inversion is determined not to be carried out according to the judgment result, calculating the gradient according to the loss function corresponding to the generator, updating the model parameters of the generator according to the gradient, and carrying out iterative training until the model converges. By the method and the device, a more robust voice conversion model is obtained, and meanwhile the accuracy of voice conversion is enhanced.

Description

Training of voice conversion model, voice conversion method, device and related equipment
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method and an apparatus for training a speech conversion model and speech conversion, and a related device.
Background
The voice conversion (VD) is to convert the voice of the source speaker into the voice tone of the target speaker while keeping the voice content the same through tone extraction and content decoupling. The application scenes of the method comprise movie and television play dubbing and sound timbre conversion in electronic book reading so as to automatically match different story roles.
The current voice conversion method is mainly based on the voice conversion for generating a countermeasure Network (GAN) and the voice conversion based on the conditional VAE. The GAN-based speech conversion method can synthesize target speech with higher similarity, but model training based on GAN is unstable, and the conditional VAE-based method is simpler in training but difficult to learn hidden variables distributed in the same way as the target speech comprehensively.
Disclosure of Invention
The method aims to solve the technical problems that in the prior art, the unstable effect of model training of voice conversion is poor or voice distribution is difficult to learn comprehensively. The application provides a method, a device and related equipment for training a voice conversion model and voice conversion, and mainly aims to obtain a more robust voice conversion model and enhance the accuracy of voice conversion.
In order to achieve the above object, the present application provides a training method of a speech conversion model of a generation network based on cyclic confrontation, the training method of the speech conversion model comprising:
acquiring a first tone gamut voice data set and a second tone gamut voice data set as training samples;
inputting first tone gamut voice data selected from the first tone gamut voice data set and second tone gamut voice data selected from the second tone gamut voice data set into the constructed circularly generated confrontation network to carry out circularly confrontation training on the circularly generated confrontation network, and acquiring a judgment result of a discriminator in the circularly generated confrontation network;
if the generator in the circularly generated countermeasure network needs to be optimized, whether gradient inversion is carried out or not is determined according to the judgment result;
if the gradient inversion is determined according to the judgment result, calculating a first loss function corresponding to the generator, inverting the first gradient calculated according to the first loss function,
updating the model parameters of the generator according to the inverted first gradient;
if the gradient inversion is determined not to be carried out according to the judgment result, calculating a first loss function corresponding to the generator, calculating a first gradient according to the first loss function, and updating the model parameters of the generator according to the first gradient;
and if the circularly generated confrontation network is not converged, the step of inputting the first tone gamut voice data selected from the first tone gamut voice data set and the second tone gamut voice data selected from the second tone gamut voice data set into the constructed circularly generated confrontation network to carry out circularly confrontation training on the circularly generated confrontation network is executed again until the circularly generated confrontation network is converged.
In order to achieve the above object, the present application further provides a voice conversion method based on a loop countermeasure generation network, the voice conversion method including:
and performing tone conversion on the input first tone gamut voice to be converted by utilizing a trained voice conversion model based on the cyclic confrontation generation network to obtain corresponding target second tone gamut voice data, wherein the trained voice conversion model based on the cyclic confrontation generation network is obtained according to any one of the training methods of the voice conversion model based on the cyclic confrontation generation network.
In order to achieve the above object, the present application further provides a training apparatus for a speech conversion model of a loop-countermeasures generation network, the training apparatus for the speech conversion model including:
the sample acquisition module is used for acquiring a first tone color gamut voice data set and a second tone color gamut voice data set as training samples;
the training module is used for inputting first tone gamut voice data selected from the first tone gamut voice data set and second tone gamut voice data selected from the second tone gamut voice data set into the constructed circularly generated confrontation network so as to carry out circularly confrontation training on the circularly generated confrontation network and obtain a judgment result of a discriminator in the circularly generated confrontation network;
the judging module is used for determining whether gradient inversion is carried out or not according to a judging result if the generator in the cyclic generation countermeasure network needs to be optimized;
a reverse module for calculating a first loss function corresponding to the generator if the gradient reverse is determined according to the judgment result, reversing the first gradient calculated according to the first loss function,
the first parameter updating module is used for updating the model parameters of the generator according to the inverted first gradient;
the second parameter updating module is used for calculating a first loss function corresponding to the generator if the gradient inversion is determined not to be carried out according to the judgment result, calculating a first gradient according to the first loss function, and updating the model parameters of the generator according to the first gradient;
and the iteration module is used for jumping to the training module until the loop generation confrontation network converges if the loop generation confrontation network does not converge.
To achieve the above object, the present application further provides a computer device, which includes a memory, a processor and computer readable instructions stored on the memory and executable on the processor, wherein the processor executes the computer readable instructions to execute the steps of the training method based on the cyclic countermeasure generation network based on the speech conversion model according to any one of the preceding claims, or the processor executes the computer readable instructions to execute the steps of the speech conversion method based on the cyclic countermeasure generation network according to any one of the preceding claims.
To achieve the above object, the present application further provides a computer readable storage medium, on which computer readable instructions are stored, which, when executed by a processor, cause the processor to execute the steps of the training method for the speech conversion model based on the loop-countermeasures generation network according to any one of the preceding claims, or cause the processor to execute the steps of the speech conversion method based on the loop-countermeasures generation network according to any one of the preceding claims.
According to the training and voice conversion method, device and related equipment of the voice conversion model, the advantage of the circulation generation countermeasure network in fitting real data distribution is utilized, the voice conversion model based on the circulation generation countermeasure network is established, unsupervised training is supported, a large amount of unsupervised data can be adopted for converting source voice to target voice timbre, and a large amount of well-matched first timbre domain voice data and second timbre domain voice data are not needed; the method and the device realize the conversion of the sound of the source speaker to the sound tone of the target speaker through tone extraction and content decoupling while keeping the same voice content. In addition, the method and the device optimize the generation capacity of the generator by using a gradient inversion technology for the loop generation confrontation network, enhance the discrimination capacity of the target tone and the source tone so as to reversely improve the generation capacity of the generator on the target tone, and enhance the confrontation training of the generator and the discriminator. According to the method and the device, the reconstruction of the voice conversion is more stable by utilizing the circulation consistency of the circulation generation countermeasure network, the model training is more accurate, the obtained voice conversion model is more robust, and the accuracy of the voice conversion is further improved.
Drawings
FIG. 1 is a schematic flowchart illustrating a method for training a speech transformation model of a recurrent confrontation-based network according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a structure of a loop generation countermeasure network according to an embodiment of the present application;
FIG. 3 is a block diagram illustrating an exemplary embodiment of a training apparatus for generating a speech transformation model of a network based on cyclic confrontation;
fig. 4 is a block diagram of an internal structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Fig. 1 is a flowchart illustrating a method for training a speech conversion model of a network based on cyclic confrontation generation according to an embodiment of the present application. Refer to fig. 1. The training method of the voice conversion model based on the loop countermeasure generation network comprises the following steps S100-S700.
S100: and acquiring a first tone gamut voice data set and a second tone gamut voice data set as training samples.
Specifically, the first tone gamut speech data set includes a plurality of pieces of first tone gamut speech data, and the second tone gamut speech data set includes a plurality of pieces of second tone gamut speech data. The voice data is a spectrum, and specifically, the voice data may be a spectrum obtained by applying short-time fourier transform to the audio data. The embodiment does not need a large amount of matched first tone color gamut voice data and second tone color gamut voice data. First and second timbre domain speech data selected from the first and second sets of timbre domain speech data comprise a set of training samples.
S200: and inputting the first tone gamut voice data selected from the first tone gamut voice data set and the second tone gamut voice data selected from the second tone gamut voice data set into the constructed circularly generated confrontation network to carry out circularly confrontation training on the circularly generated confrontation network, and acquiring the judgment result of a discriminator in the circularly generated confrontation network.
Specifically, a Cycle generated countermeasure Network (Cycle GAN) includes a generator (generation model or generator) and a discriminator (discriminator or discrimination model or discriminator). The purpose of the application is to realize the conversion from the source tone color gamut to the target tone color gamut, therefore, the cyclic generation countermeasure network of the application learns the tone color distribution of the voice data in the data set through countermeasure training, and completes the tone color migration from the source tone color gamut to the target tone color gamut. The cyclic generation countermeasure network not only needs to fit the tone distribution of the target tone domain voice data, but also keeps the content characteristics of the source domain voice data.
The generator is used for capturing or fitting the distribution of the training data and generating the similar data distribution similar to the real training data, and the pursuit effect is that the more the real training data is, the better the real training data is.
After being input into the circularly generated countermeasure network, the voice data is converted into hidden vector representation for representing the original voice data through feature extraction.
In this embodiment, the generator for cyclically generating the countermeasure network is specifically configured to fit the distribution of the voice data, and generate the generated voice data imitating the second tone gamut voice data according to the first tone gamut voice data, or generate the generated voice data imitating the first tone gamut voice data according to the second tone gamut voice data.
The discriminator is used for learning the mapping relation between the input voice data and the output class label, namely estimating the probability of the generated voice data generated by the generator from the training data, judging whether the generated voice data is real or false, and feeding back the discrimination result to the generator. If the pre-estimated generated voice data comes from the training data, the probability output by the discriminator is high, otherwise, the probability output by the discriminator is low.
The purpose of the generator is to confuse the arbiter, while the purpose of the arbiter is not to be confused by the generator, which learns the mapping relationships through countertraining with the arbiter. The two networks are alternately trained until the data generated by the generator can be in a false or spurious mode and reach a certain balance with the discrimination capability of the discriminator.
The discriminator of this embodiment is specifically configured to discriminate whether generated speech data imitating the second tone gamut speech data is second tone gamut speech data or matches the distribution of the second tone gamut speech data, and output a corresponding probability; or judging whether the generated voice data imitating the first tone color gamut voice data is the first tone color gamut voice data or accords with the distribution of the first tone color gamut voice data, and outputting corresponding probability. The judgment result is the corresponding probability of the output.
S300: and if the generator in the cyclic generation countermeasure network needs to be optimized, determining whether gradient inversion is carried out or not according to the judgment result.
Specifically, the cycle generating countermeasure network is trained, i.e. model parameters or network parameters of the generator and the arbiter are optimized. Network parameters of a fixed generator can be adopted, and network parameters of a discriminator are optimized by adopting a cross entropy loss function; and fixing the network parameters of the arbiter by adopting a mode of optimizing the network parameters of the generator by using a cross entropy loss function. It is of course also possible to optimize the network parameters of both the arbiter and the generator.
Regardless of the optimization method, if the generator needs to be optimized currently or the current training node is the generator optimization node, it needs to determine whether to perform Gradient inversion (Gradient reverse) on the Gradient corresponding to the generator according to the determination result of the determiner. That is, whether to perform gradient inversion (inversion) is determined according to whether the discriminator determines that the generated voice data is from the voice data set.
If the model parameters of the generator do not need to be optimized currently, whether gradient inversion is performed or not does not need to be determined according to the judgment result, and gradient inversion does not need to be executed.
After step S300 is executed, the process proceeds to step S400 or step S600.
S400: and if the gradient inversion is determined according to the judgment result, calculating a first loss function corresponding to the generator, and inverting the first gradient calculated according to the first loss function.
Specifically, if it is determined that gradient inversion needs to be performed on the gradient of the generator according to the determination result, a first loss function corresponding to the current training node generator is calculated, a corresponding first gradient is calculated according to the first loss function, and the first gradient is inverted.
By calculating the difference between the distribution of the generated voice data and the distribution of the real voice data, the first loss function corresponding to the generator can be obtained.
S500: updating the model parameters of the generator according to the inverted first gradient.
Specifically, the model parameters of the generator are updated according to the inverted first gradient to obtain a new pre-training loop to generate the countermeasure network. After step S500 is executed, the process proceeds to step S700, and the data set is used to perform iterative training again on the new pre-training loop generation countermeasure network.
S600: and if the gradient inversion is determined not to be carried out according to the judgment result, calculating a first loss function corresponding to the generator, calculating a first gradient according to the first loss function, and updating the model parameters of the generator according to the first gradient.
Specifically, if it is determined according to the judgment result that gradient inversion is not performed, a first loss function corresponding to the current training node generator is calculated, a corresponding first gradient is calculated according to the first loss function, model parameters of the generator are updated according to the non-inverted first gradient, and a new pre-training loop generation countermeasure network is obtained. After step S600 is executed, step S700 is performed, and the new pre-training loop generation countermeasure network is iteratively trained again by using the data set.
The present embodiment determines whether to perform inversion according to the discrimination result of the discriminator, performs inversion when the discrimination result is that the generated speech data is from the training sample, and normally and reversely transfers the gradient back when the discrimination result is that the generated speech data does not belong to the training sample, without changing the gradient.
In the embodiment, the gradient can be calculated through a back propagation algorithm (the difference between the predicted value and the true value is transmitted back layer by layer as loss, and each layer of network can calculate the gradient according to the transmitted loss), and the model parameters of the confrontation network are generated circularly by adopting a random gradient descent algorithm and the calculated gradient.
S700: if the loop generation countermeasure network does not converge, step S200 is executed again until the loop generation countermeasure network converges.
Specifically, the process generator and the arbiter for generating the training of the countermeasure network in a loop are alternately trained (alternately optimized for parameters). When one party is fixed, the network parameters (model parameters) of the other party are updated, and iteration is carried out alternately, in the process, both the generator and the arbiter try to optimize the network of the generator and the arbiter, so that competitive confrontation is formed until both parties reach a dynamic balance (Nash equilibrium).
And if the loop generation confrontation network does not reach the convergence condition, skipping to execute the step S200 and the following steps, and performing iterative training on the new pre-training loop generation confrontation network by using the data set until the loop generation confrontation network converges. The convergence condition can be that the training times reach a preset iteration time or the loss function of the loop generation countermeasure network is smaller than a loss threshold value.
In addition, the first tone gamut voice data set may be divided into a first training set and a first test set, and the second tone gamut voice data set may be divided into a second training set and a second test set. And training the circularly generated countermeasure network by utilizing the first training set and the second training set, and verifying the circularly generated countermeasure network after training by utilizing the first testing set and the second testing set.
In the embodiment, the advantage of fitting real data distribution of the cyclic generation countermeasure network is utilized, the voice conversion model based on the cyclic generation countermeasure network is established, unsupervised training is supported, voice data of two data sets do not need to correspond to each other between a source domain and a target domain, namely, a large amount of unsupervised data are adopted to convert the voice color of source voice to target voice, and a large amount of matched first voice color domain voice data and second voice color domain voice data are not needed, so that voice color migration can be realized.
The present embodiment also constrains and limits the generator to preserve the voice content characteristics of the source domain voice by using a round robin generation rival network. The method and the device realize the conversion of the sound of the source speaker to the sound tone of the target speaker through tone extraction and content decoupling while keeping the same voice content.
In addition, the embodiment uses a gradient inversion mechanism for the loop generation countermeasure network, so that the discrimination capability of the target tone and the source tone is enhanced, and the generation capability of the generator on the target tone is reversely improved. The voice conversion is more stable in reconstruction and more accurate in model training by utilizing the cyclic consistency of the cyclic generation countermeasure network, and the accuracy of the voice conversion is further improved.
Fig. 2 is a schematic structural diagram of a cyclically generated countermeasure network in an embodiment of the present application, and referring to fig. 2, the cyclically generated countermeasure network includes a first generator, a second generator, a first discriminator and a second discriminator;
step S200 specifically includes:
generating first generated voice data imitating a second sound color gamut from the first sound gamut voice data through a first generator, and reconstructing the first generated voice data through a second generator to obtain second generated voice data imitating the first sound color gamut;
generating second tone gamut voice data into third generated voice data imitating the first tone gamut through a second generator, and reconstructing the third generated voice data through the first generator to obtain fourth generated voice data imitating the second tone gamut;
judging whether the first generated voice data is second tone gamut voice data or not through a second discriminator to obtain a first discrimination result;
and judging whether the third generated voice data is the first tone gamut voice data or not by the first discriminator to obtain a second judgment result.
Specifically, a Cycle-generated countermeasure network (Cycle GAN model) is formed of two sets of GAN models in a dual form, each including a generator and an associated discriminator. The input of the first generator is voice data of the first gamut or voice data imitating the first gamut. The first generator is used for generating the voice data of the first sound color gamut into voice data imitating the second sound color gamut, or reconstructing the voice data imitating the first sound color gamut into voice data imitating the second sound color gamut.
The input of the second generator is the voice data of the second sound gamut or the voice data imitating the second sound gamut. The second generator is used for generating the voice data of the second sound color gamut into voice data imitating the first sound color gamut, or reconstructing the voice data imitating the second sound color gamut into voice data imitating the first sound color gamut.
The first discriminator is configured to discriminate whether the generated voice data imitating the first color gamut is the voice data imitating the first color gamut or conforms to the distribution of the voice data imitating the first color gamut, that is, to output a probability that the generated voice data imitating the first color gamut is the voice data imitating the first color gamut. More specifically, the first discriminator receives the first tone gamut voice data and the third generated voice data as input, and the first discriminator discriminates whether the third generated voice data is the first tone gamut voice data, that is, discriminates the probability that the third generated voice data is from the first tone gamut voice data, thereby obtaining the second discrimination result.
The second discriminator is configured to discriminate whether the generated voice data imitating the second color gamut is the voice data imitating the second color gamut or whether the generated voice data imitating the second color gamut conforms to the distribution of the voice data imitating the second color gamut, that is, output a probability that the generated voice data imitating the second color gamut is the voice data imitating the second color gamut. More specifically, the second generated speech data and the first generated speech data are input to a second discriminator, and the second discriminator determines whether the first generated speech data is the second tone gamut speech data, that is, determines the probability that the first generated speech data is from the second tone gamut speech data, to obtain the first determination result.
In addition, step S200 specifically includes:
inputting the first tone gamut voice data into a first discriminator for discrimination to obtain a third discrimination result;
and inputting the second tone gamut voice data into a second discriminator for discrimination to obtain a fourth discrimination result.
Specifically, the first discriminator is further configured to discriminate whether the first gamut voice data is the first gamut voice data, obtain a third discrimination result, and further prompt the first discriminator to learn the voice distribution of the first gamut voice data, so that the first discriminator is favorable to better distinguish the voice data imitating the first gamut from the voice data true to the first gamut, and the authenticity distinguishing capability is enhanced.
The second discriminator is also used for discriminating whether the second tone gamut voice data is the voice data of the second tone gamut or not to obtain a fourth discrimination result, and further prompting the second discriminator to learn the voice distribution of the second tone gamut voice data, so that the second discriminator is favorable for better distinguishing the voice data imitating the second tone gamut from the voice data of the real second tone gamut, and the authenticity distinguishing capability is enhanced.
The discriminator is generally a classifier based on two categories, and if the speech data generated by the discriminator is from the training data or the data set, the output probability is 1, otherwise the output probability is 0. The true and false discriminating capability of the discriminator may not be discriminated when the true and false discriminating capability of the discriminator is weak, and thus the true and false discriminating capability of the discriminator may be trained by continuous iterative training. The voice data generated by the generator can be easily identified by the discriminator when the generating capability of the generator is weak, and the generator can generate the voice data which is fake and genuine to deceive the discriminator through continuous iterative training. Therefore, through the game of the generator and the discriminator, the capability of the imitation voice data generated by the generator approaching the voice signal of the target domain is improved. Meanwhile, the discrimination capability of the discriminator on the voice-like data is improved.
The first loss function corresponding to the generator is calculated from the first tone gamut voice data, the second tone gamut voice data, and the generated voice data including at least one of the first generated voice data, the second generated voice data, the third generated voice data, and the fourth generated voice data.
In one embodiment, the loop generation countermeasure network further includes a first gradient inversion layer and a second gradient inversion layer;
step S400 and step S500 specifically include:
if the first judgment result is that the first generated voice data is the second tone gamut voice data, calculating a first sub-loss function corresponding to the first generator, calculating a first sub-gradient corresponding to the first generator according to the first sub-loss function, inverting the first sub-gradient through a first gradient inversion layer, and updating model parameters of the first generator according to the inverted first sub-gradient, wherein the first sub-loss function is calculated according to the first tone gamut voice data, the second tone gamut voice data and the first converted voice data, and the first converted voice data comprises at least one of the first generated voice data, the second generated voice data, the third generated voice data and the fourth generated voice data;
and if the second judgment result is that the third generated voice data is the first tone gamut voice data, calculating a second sub-loss function corresponding to the second generator, calculating a second sub-gradient corresponding to the second generator according to the second sub-loss function, inverting the second sub-gradient through a second gradient inversion layer, and updating the model parameters of the second generator according to the inverted second sub-gradient, wherein the second sub-loss function is calculated according to the first tone gamut voice data, the second tone gamut voice data and the second converted voice data, and the second converted voice data comprises at least one of the first generated voice data, the second generated voice data, the third generated voice data and the fourth generated voice data.
Specifically, the model parameters of the first generator and the second generator may be updated at the same training node or at different training nodes. But whether the first generator and the second generator perform gradient inversion is not performed synchronously. The first generator determines whether to perform gradient inversion according to the first discrimination result, and the second generator determines whether to perform gradient inversion according to the second discrimination result. The first and second discrimination results are not necessarily the same discrimination result, and therefore, there is a possibility that the gradient of the first generator needs to be inverted and the gradient of the second generator does not need to be inverted, there is a possibility that the gradient of the first generator does not need to be inverted and the gradient of the second generator needs to be inverted, there is a possibility that both the gradient of the first generator and the gradient of the second generator need to be inverted, and there is a possibility that both the gradient of the first generator and the gradient of the second generator do not need to be inverted.
The first loss function includes a first sub-loss function corresponding to the first generator and a second sub-loss function corresponding to the second generator. If gradient inversion is required, the first generator updates the model parameters according to the inverted first sub-gradient. The first sub-gradient is calculated from a first sub-loss function. The second generator updates the model parameters according to the inverted second sub-gradient. The second sub-gradient is calculated from a second sub-loss function. The first gradient includes a first sub-gradient and a second sub-gradient.
In addition, the first sub-loss function and the second sub-loss function are losses obtained by calculating a difference between the generated speech data and the source speech data or the target speech data. The larger the difference, the higher the penalty the producer will be subjected to.
In one embodiment, the method further comprises:
before the convergence of the circularly generated antagonistic network, if the arbiter needs to be optimized, calculating a second loss function corresponding to the arbiter, calculating a second gradient corresponding to the arbiter according to the second loss function, and updating the model parameters of the arbiter according to the second gradient.
Specifically, the second loss function includes a third sub-loss function corresponding to the first discriminator and a fourth sub-loss function corresponding to the second discriminator. The second gradient includes a third sub-gradient corresponding to the first discriminator and a fourth sub-gradient corresponding to the second discriminator.
The second loss function corresponding to the discriminator can be obtained by calculating the classification or discrimination result of the discriminator on the generated voice data.
And calculating a third sub-loss function corresponding to the first discriminator, calculating a third sub-gradient corresponding to the first discriminator according to the third sub-loss function, and updating the model parameter of the first discriminator according to the third sub-gradient, wherein the third sub-loss function is calculated according to at least two of the first tone color gamut voice data, the second tone color gamut voice data, the first generated voice data, the second generated voice data, the third generated voice data and the fourth generated voice data.
And calculating a fourth sub-loss function corresponding to the second discriminator, calculating a fourth sub-gradient corresponding to the second discriminator according to the fourth sub-loss function, and updating the model parameters of the second discriminator according to the fourth sub-gradient, wherein the fourth sub-loss function is calculated according to at least two of the first tone gamut voice data, the second tone gamut voice data, the first generated voice data, the second generated voice data, the third generated voice data and the fourth generated voice data.
Since the generator does not need to be trained when the discriminator is trained, neither the gradient of the generator nor the judgment as to whether or not the gradient inversion is performed is required.
In one embodiment, the method further comprises:
calculating a total loss function of the loop generation countermeasure network, wherein the total loss function is calculated according to a loop consistency loss function and an opposition loss function, or the loop consistency loss function, the opposition loss function and an ontology mapping loss function;
and if the total loss function is smaller than the preset convergence value, judging that the countermeasure network is generated circularly and converged.
Specifically, the closer the distribution of the voice data generated by the generator and the real voice data is to each other in the round robin generation countermeasure network, the better. The stronger the discrimination ability of the discriminator for distinguishing true and false voices, the better. The optimization process of circularly generating the countermeasure network is a minimum game (Minimax game) problem.
The total loss function of the Cycle generation countermeasure network (Cycle GAN) is specifically:
L=L 11 L 22 L 3 or, L ═ L 11 L 2
Wherein L is the total loss function, L 1 As a function of cyclic consistency loss, L 2 To combat the loss function, L 3 The loss function is mapped to an ontology. Lambda [ alpha ] 1 And λ 2 And the hyperparameters are set for representing the weights among the adjustment cycle consistency loss function, the countermeasure loss function and the ontology mapping function.
The Cycle Consistency loss (Cycle Consistency loss) is specifically the difference or distance between the real speech data and the corresponding reconstructed generated speech data. The round robin consistency loss in this application is the distance between the first gamut speech data and the second generated speech data and the distance between the second gamut speech data and the fourth generated speech data.
L 1 The calculation formula (2) is shown in formula (1):
L 1 =L cyc (G,F)=E A~Pdata(A) [||F(G(A))-A|| 1 ] +E B~Pdata(B) [||G(F(B))-B|| 1 ]formula (1)
The mapping function G corresponding to the first generator is used for A- > B, and G (A) represents that the voice data imitating the second sound color gamut is generated according to the input of the voice data of the first sound color gamut. F (g (a)) represents reconstructing the voice data imitating the second sound gamut into voice data imitating the first sound gamut.
And the mapping function F corresponding to the second generator is used for B- > A, F (B) to represent that the voice data imitating the first sound color gamut is generated according to the input of the voice data of the second sound color gamut. G (f (b)) represents reconstructing the voice data imitating the first sound gamut into voice data imitating the second sound gamut.
Wherein, A is real first tone color domain voice data, and B is real second tone color domain voice data. E A~Pdata(A) Expected value representing A-domain (first gamut) voice data distribution, E B~Pdata(B) Indicating the expected value of the B-domain (second gamut) voice data distribution.
The penalty function (GAN loss or adaptive loss) is determined by the result of the discrimination by the discriminator, and is specifically expressed by the following equations (2) to (4):
L 2 =L GAN (G,D B ,A,B)+L GAN (F,D A formula (2) of A, B)
Wherein L is GAN (G,D B ,A,B)=E B~Pdata(B) [log D B (B)]+
E A~Pdata(A) [log(1-D B (G(A))]) Formula (3)
L GAN (F,D A ,A,B)=E A~Pdata(A) [log D A (A)]+
E B~Pdata(B) [log(1-D A (F(B))]) Formula (4)
The ontology mapping function is specifically shown in the following formula (5):
L 3 =L id (G,F)=E B~Pdata(B) [||G(B)-B|| 1 ]+
E A~Pdata(A) [||F(A)-A|| 1 ]formula (5)
Wherein D is B (B) Determining for the second discriminator the probability that the real second gamut speech data is from the second set of gamut speech data (probability of being real speech data), D B (G (a)) is a probability (probability of being real voice data) that the second discriminator judges the first generated voice data G (a) generated by the first generator G from the second tone gamut voice data set.
Wherein D is A (A) Determine for the first discriminator a probability that the true first-gamut speech data is from the first-gamut speech data set (probability of being true speech data), D A (F (b)) is a probability (probability of being real voice data) that the first discriminator judges the third generated voice data F (b) generated by the second generator F from the first tone gamut voice data set.
And optimizing model parameters in the network structure by continuously reducing the total loss function L in the model training process of circularly generating the countermeasure network until the model converges. The loss functions calculated in the gradient inverse transfer process all need to be inversely transferred.
The method comprises the steps of training according to random gradient descent in a cyclic generation countermeasure network, updating model parameters of a discriminator under the condition that model parameters of a generator are fixed, and updating the model parameters of the generator under the condition that the model parameters of the discriminator are fixed. Preferably, the arbiter is trained first, and then the generator is trained.
The first sub-loss function of the first generator is the total loss function L, and the gradient inversion may invert the gradient calculated according to the total loss function L, or may invert only the gradient L GAN (G,D B A, B) the gradient calculated by this loss function is inverted and the rootThe gradients calculated from the other loss functions in the total loss function L are not inverted.
The second sub-loss function of the second generator is the total loss function L, and the gradient inversion may invert the gradient calculated according to the total loss function L, or may invert only the gradient L GAN (F,D A A, B) the gradient calculated from this loss function is inverted, whereas the gradients calculated from the other loss functions in the total loss function L are not inverted.
In one embodiment, inverting the first gradient comprises: and taking the reciprocal of the first gradient, or multiplying the first gradient by a preset negative number.
Specifically, for example, if the first gradient is g, the inverted first gradient obtained by inverting the first gradient is 1/g.
The preset negative number may be-1, and the value of the preset negative number may be set according to the actual situation, which is not limited in this application.
In one embodiment, the generator comprises a first convolution layer, a maximum pooling layer, a second convolution layer and an LSTM layer which are arranged in sequence, and the discriminator comprises a third convolution layer, an expansion layer, a full-connection layer and a normalization layer which are arranged in sequence;
the LSTM layer is connected to the third convolution layer of the corresponding discriminator by the corresponding gradient inversion layer.
Specifically, each generator includes at least 2 first convolution layers arranged in sequence, the first convolution layers having a size larger than that of the second convolution layers. Each discriminator comprises at least 2 third convolution layers which are sequentially arranged, the size of each third convolution layer can be the same as that of each second convolution layer, each expansion layer is used for reducing the dimension of the output of each third convolution layer, each normalization layer is used for outputting a probability value, and each normalization layer specifically uses a softmax function.
The LSTM layer of the first generator is connected with the third convolution layer of the second discriminator through the first inversion layer, and the LSTM layer of the second generator is connected with the third convolution layer of the first discriminator through the second inversion layer.
The reverse Layer corresponds to English and is a Gradient reverse Layer, the reverse Layer does not affect parameter change of the discriminator, the standard reverse Layer changes Gradient transmission from positive to negative or from negative to integer or can also change the Gradient transmission into reciprocal transmission, and therefore the parameter of the generator is changed to be optimized according to the Gradient after reverse in order to minimize loss.
The application also provides a voice conversion method based on the loop countermeasure generation network, and the voice conversion method comprises the following steps:
and performing tone conversion on the input first tone gamut voice to be converted by utilizing a trained voice conversion model based on the cyclic confrontation generation network to obtain corresponding target second tone gamut voice data, wherein the trained voice conversion model based on the cyclic confrontation generation network is obtained according to any one of the training methods of the voice conversion model based on the cyclic confrontation generation network.
The method and the device use the cycle to generate the cycle consistency loss of the countermeasure network, so that the model training in the voice conversion can realize the unsupervised training, and the reconstruction of the coder and the decoder in the training can be more stable. The gradient inversion technology is introduced to optimize the generating capability of the generator, the confrontation training of the generator and the discriminator is enhanced, and the discrimination capability of the discriminator is enhanced, so that the generating capability of the generator to the target tone is reversely improved, and the trained voice conversion model is more robust and the voice conversion capability is stronger.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Fig. 3 is a block diagram illustrating a structure of a training apparatus for generating a speech conversion model of a network based on cyclic confrontation according to an embodiment of the present application. Referring to fig. 3, the training apparatus for generating a speech conversion model of a network based on cyclic confrontation includes:
a sample obtaining module 100, configured to obtain a first tone gamut voice data set and a second tone gamut voice data set as training samples;
a training module 200, configured to input first tone gamut voice data selected from the first tone gamut voice data set and second tone gamut voice data selected from the second tone gamut voice data set to the constructed cyclic generation confrontation network to perform cyclic confrontation training on the cyclic generation confrontation network, so as to obtain a determination result of a discriminator in the cyclic generation confrontation network;
a judging module 300, configured to determine whether to perform gradient inversion according to a determination result if the generator in the cyclic generation countermeasure network needs to be optimized;
a reverse module 400, configured to calculate a first loss function corresponding to the generator if it is determined to perform gradient reverse according to the determination result, reverse the first gradient calculated according to the first loss function,
a first parameter updating module 500, configured to update a model parameter of the generator according to the inverted first gradient;
a second parameter updating module 600, configured to calculate a first loss function corresponding to the generator if it is determined according to the determination result that gradient inversion is not performed, calculate a first gradient according to the first loss function, and update the model parameter of the generator according to the first gradient;
and the iteration module 700 is configured to jump to the training module 200 until the loop generation countermeasure network converges if the loop generation countermeasure network does not converge.
In one embodiment, the loop-generating countermeasure network includes a first generator, a second generator, a first discriminator, and a second discriminator;
the training module 200 specifically includes:
a first generation module for generating the first gamut voice data into first generated voice data imitating a second gamut by a first generator,
the second generating module is used for reconstructing the first generated voice data through a second generator to obtain second generated voice data imitating the first voice color gamut;
a second generation module for generating the second tone gamut voice data into third generated voice data imitating the first tone gamut by a second generator,
the first generating module is further used for reconstructing the third generated voice data through the first generator to obtain fourth generated voice data imitating the second voice color gamut;
the first judgment module is used for judging whether the first generated voice data is second tone gamut voice data through a second discriminator to obtain a first judgment result;
and the second judging module is used for judging whether the third generated voice data is the first tone gamut voice data through the first discriminator to obtain a second judging result.
In one embodiment, the loop generation countermeasure network further includes a first gradient inversion layer and a second gradient inversion layer;
the inversion module 400 includes:
the first inversion module is used for calculating a first sub-loss function corresponding to the first generator if the first judgment result is that the first generated voice data is the second tone gamut voice data, calculating a first sub-gradient corresponding to the first generator according to the first sub-loss function, and inverting the first sub-gradient through a first gradient inversion layer, wherein the first sub-loss function is calculated according to the first tone gamut voice data, the second tone gamut voice data and the first converted voice data, and the first converted voice data comprises at least one of the first generated voice data, the second generated voice data, the third generated voice data and the fourth generated voice data;
and the second inversion module is configured to calculate a second sub-loss function corresponding to the second generator if the second determination result indicates that the third generated speech data is the first tone gamut speech data, calculate a second sub-gradient corresponding to the second generator according to the second sub-loss function, and invert the second sub-gradient through a second gradient inversion layer, where the second sub-loss function is calculated according to the first tone gamut speech data, the second tone gamut speech data, and the second converted speech data includes at least one of the first generated speech data, the second generated speech data, the third generated speech data, and the fourth generated speech data.
The first parameter updating module 500 specifically includes:
the first sub-updating module is used for updating the model parameters of the first generator according to the inverted first sub-gradient;
and the second sub-updating module is used for updating the model parameters of the second generator according to the inverted second sub-gradient.
In one embodiment, the apparatus further comprises:
and the third parameter updating module is used for calculating a second loss function corresponding to the discriminator if the discriminator needs to be optimized before the confrontation network convergence is generated circularly, calculating a second gradient corresponding to the discriminator according to the second loss function, and updating the model parameters of the discriminator according to the second gradient.
In one embodiment, the apparatus further comprises:
the total loss calculation module is used for calculating a total loss function of the loop generation countermeasure network, wherein the total loss function is obtained by calculation according to a loop consistency loss function and an countermeasure loss function, or the loop consistency loss function, the countermeasure loss function and an ontology mapping loss function;
and the convergence judging module is used for judging that the countermeasure network is generated circularly to be converged if the total loss function is smaller than a preset convergence value.
In one embodiment, the inversion module 400 includes:
and the inversion unit is used for taking the reciprocal of the first gradient or multiplying the first gradient by a preset negative number.
In one embodiment, the generator comprises a first convolution layer, a maximum pooling layer, a second convolution layer and an LSTM layer which are arranged in sequence, and the discriminator comprises a third convolution layer, an expansion layer, a full-connection layer and a normalization layer which are arranged in sequence;
the LSTM layer is connected to the third convolution layer of the corresponding discriminator through the corresponding gradient inversion layer.
Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, and the division of modules into blocks presented herein is merely a logical division and may be implemented in a further manner in actual practice.
For specific definition of the training apparatus for the speech conversion model based on the cyclic confrontation generating network, reference may be made to the above definition of the training method for the speech conversion model based on the cyclic confrontation generating network, and details are not repeated here. The modules in the training device for generating the voice conversion model of the network based on the loop countermeasure can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 4 is a block diagram of an internal structure of a computer device according to an embodiment of the present application. As shown in fig. 4, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory includes a storage medium and an internal memory. The storage medium may be a nonvolatile storage medium or a volatile storage medium. The storage medium stores an operating system and may also store computer readable instructions that, when executed by the processor, may cause the processor to implement a method of training a speech conversion model of a loop-countermeasures generation network or a method of speech conversion of a loop-countermeasures generation network. The internal memory provides an environment for the operating system and execution of computer readable instructions in the storage medium. The internal memory may also have computer readable instructions stored therein that, when executed by the processor, may cause the processor to perform a method of training a speech conversion model based on a loop-countervailing generation network or a method of speech conversion based on a loop-countervailing generation network. The network interface of the computer device is used for communicating with an external server through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
In one embodiment, a computer device is provided, which includes a memory, a processor, and computer readable instructions (e.g., a computer program) stored on the memory and executable on the processor, and when the processor executes the computer readable instructions, the steps of the method for training a speech conversion model based on a loop-countermeasures generation network or the method for speech conversion based on a loop-countermeasures generation network in the above embodiments are implemented, such as the steps S100 to S700 shown in fig. 1 and other extensions of the method and extensions of related steps. Alternatively, the processor executes the computer readable instructions to implement the functions of the modules/units of the training apparatus for generating the speech conversion model of the network based on the cyclic confrontation or the speech conversion apparatus for generating the network based on the cyclic confrontation in the above embodiments, such as the functions of the modules 100 to 700 shown in fig. 3. To avoid repetition, further description is omitted here.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.
The memory may be used to store computer readable instructions and/or modules that the processor implements by running or executing and invoking data stored in the memory, various functions of the computer device. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.
The memory may be integrated in the processor or may be provided separately from the processor.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer-readable storage medium is provided, on which computer-readable instructions are stored, which when executed by a processor implement the steps of the method for training a speech conversion model based on a loop-countermeasures generation network or the method for speech conversion based on a loop-countermeasures generation network in the above-described embodiments, such as the steps S100 to S700 shown in fig. 1 and extensions of other extensions and related steps of the method. Alternatively, the computer readable instructions, when executed by the processor, implement the functions of the modules/units of the training apparatus for generating the speech conversion model of the network based on the cyclic confrontation or the speech conversion apparatus for generating the network based on the cyclic confrontation in the above embodiments, such as the functions of the modules 100 to 700 shown in fig. 3. To avoid repetition, further description is omitted here.
It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing associated hardware to implement computer readable instructions, which may be stored in a computer readable storage medium, and when executed, may include processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present application may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (11)

1. A training method of a voice conversion model of a network generated based on cyclic confrontation is characterized by comprising the following steps:
acquiring a first tone gamut voice data set and a second tone gamut voice data set as training samples;
inputting first tone gamut voice data selected from the first tone gamut voice data set and second tone gamut voice data selected from the second tone gamut voice data set into a constructed cyclic generation confrontation network to perform cyclic confrontation training on the cyclic generation confrontation network, and acquiring a discrimination result of a discriminator in the cyclic generation confrontation network;
if the generator in the cyclic generation countermeasure network needs to be optimized, whether gradient inversion is carried out or not is determined according to the judgment result;
if the gradient inversion is determined according to the judgment result, calculating a first loss function corresponding to the generator, inverting the first gradient calculated according to the first loss function,
updating the model parameters of the generator according to the inverted first gradient;
if the gradient inversion is determined not to be carried out according to the judgment result, calculating a first loss function corresponding to the generator, calculating a first gradient according to the first loss function, and updating the model parameters of the generator according to the first gradient;
if the loop generation countermeasure network does not converge, the step of inputting the first tone gamut voice data selected from the first tone gamut voice data set and the second tone gamut voice data selected from the second tone gamut voice data set into the constructed loop generation countermeasure network to perform loop generation countermeasure training on the loop generation countermeasure network is executed again until the loop generation countermeasure network converges.
2. The method of claim 1, wherein the cycle generating countermeasure network includes a first generator, a second generator, a first discriminator, and a second discriminator;
the inputting the first tone gamut voice data selected from the first tone gamut voice data set and the second tone gamut voice data selected from the second tone gamut voice data set into the constructed loop generation countermeasure network to perform loop countermeasure training on the loop generation countermeasure network, and acquiring a judgment result of a discriminator in the loop generation countermeasure network, includes:
generating first generated voice data imitating a second sound color gamut from the first sound color gamut voice data through the first generator, and reconstructing the first generated voice data through the second generator to obtain second generated voice data imitating the first sound color gamut;
generating second tone gamut voice data into third generated voice data imitating the first tone gamut by the second generator, and reconstructing the third generated voice data by the first generator to obtain fourth generated voice data imitating the second tone gamut;
judging whether the first generated voice data is the second tone gamut voice data or not through the second judging device to obtain a first judging result;
and judging whether the third generated voice data is the first tone gamut voice data or not through the first discriminator to obtain a second judgment result.
3. The method of claim 2, wherein the loop generation countermeasure network further comprises a first gradient inversion layer and a second gradient inversion layer;
if the gradient inversion is determined according to the judgment result, calculating a first loss function corresponding to the generator, inverting the first gradient calculated according to the first loss function, and updating the model parameter of the generator according to the inverted first gradient, including:
if the first judgment result is that the first generated speech data is the second tone gamut speech data, calculating a first sub-loss function corresponding to the first generator, calculating a first sub-gradient corresponding to the first generator according to the first sub-loss function, inverting the first sub-gradient through the first gradient inversion layer, and updating model parameters of the first generator according to the inverted first sub-gradient, wherein the first sub-loss function is calculated according to the first tone gamut speech data, the second tone gamut speech data and the first converted speech data, and the first converted speech data comprises at least one of the first generated speech data, the second generated speech data, the third generated speech data and the fourth generated speech data;
if the second determination result indicates that the third generated speech data is the first tone gamut speech data, calculating a second sub-loss function corresponding to the second generator, calculating a second sub-gradient corresponding to the second generator according to the second sub-loss function, inverting the second sub-gradient through the second gradient inversion layer, and updating model parameters of the second generator according to the inverted second sub-gradient, where the second sub-loss function is calculated according to the first tone gamut speech data, the second tone gamut speech data, and second converted speech data, and the second converted speech data includes at least one of the first generated speech data, the second generated speech data, the third generated speech data, and the fourth generated speech data.
4. The method of claim 3, further comprising:
before the convergence of the circularly generated countermeasure network, if the arbiter needs to be optimized, calculating a second loss function corresponding to the arbiter, calculating a second gradient corresponding to the arbiter according to the second loss function, and updating the model parameters of the arbiter according to the second gradient.
5. The method of claim 1, further comprising:
calculating a total loss function of the loop generation countermeasure network, wherein the total loss function is calculated according to a loop consistency loss function and an opposition loss function, or the loop consistency loss function, the opposition loss function and an ontology mapping loss function;
and if the total loss function is smaller than a preset convergence value, judging that the circularly generated countermeasure network converges.
6. The method of claim 1, wherein said inverting said first gradient comprises: and taking the reciprocal of the first gradient, or multiplying the first gradient by a preset negative number.
7. The method of claim 1, wherein the generator comprises a first convolutional layer, a max-pooling layer, a second convolutional layer, and an LSTM layer, which are sequentially arranged, and the discriminator comprises a third convolutional layer, an unwind layer, a fully-connected layer, and a normalization layer, which are sequentially arranged;
the LSTM layer is connected to the third convolution layer of the corresponding discriminator by the corresponding gradient inversion layer.
8. A voice conversion method for generating a network based on cyclic confrontation, the voice conversion method comprising:
performing tone conversion on input first tone gamut voice to be converted by using a trained voice conversion model based on a cyclic confrontation generating network to obtain corresponding target second tone gamut voice data, wherein the trained voice conversion model based on the cyclic confrontation generating network is obtained by the training method of the voice conversion model based on the cyclic confrontation generating network according to any one of claims 1 to 7.
9. An apparatus for training a speech conversion model of a network based on cyclic confrontation generation, the apparatus comprising:
the sample acquisition module is used for acquiring a first tone color gamut voice data set and a second tone color gamut voice data set as training samples;
the training module is used for inputting first tone gamut voice data selected from the first tone gamut voice data set and second tone gamut voice data selected from the second tone gamut voice data set into the constructed cyclic generation confrontation network so as to carry out cyclic confrontation training on the cyclic generation confrontation network and obtain a judgment result of a discriminator in the cyclic generation confrontation network;
the judging module is used for determining whether gradient inversion is carried out or not according to the judging result if the generator in the cyclic generation countermeasure network needs to be optimized;
a reverse module, configured to calculate a first loss function corresponding to the generator if gradient reverse is determined according to the determination result, reverse the first gradient calculated according to the first loss function,
the first parameter updating module is used for updating the model parameters of the generator according to the inverted first gradient;
a second parameter updating module, configured to calculate a first loss function corresponding to the generator if it is determined according to the determination result that gradient inversion is not performed, calculate a first gradient according to the first loss function, and update the model parameter of the generator according to the first gradient;
and the iteration module is used for jumping to the training module until the loop generation confrontation network converges if the loop generation confrontation network does not converge.
10. A computer device comprising a memory, a processor and computer readable instructions stored on the memory and executable on the processor, wherein the processor when executing the computer readable instructions performs the steps of the method for training a speech conversion model based on a loop-countermeasures generation network according to any one of claims 1-7, or wherein the processor when executing the computer readable instructions performs the steps of the method for speech conversion based on a loop-countermeasures generation network according to claim 8.
11. A computer readable storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to perform the steps of the method for training a speech conversion model based on a loop-through countermeasure generation network according to any one of claims 1 to 7, or cause the processor to perform the steps of the method for speech conversion based on a loop-through countermeasure generation network according to claim 8.
CN202210517643.8A 2022-05-13 2022-05-13 Training of voice conversion model, voice conversion method, device and related equipment Pending CN114882897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210517643.8A CN114882897A (en) 2022-05-13 2022-05-13 Training of voice conversion model, voice conversion method, device and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210517643.8A CN114882897A (en) 2022-05-13 2022-05-13 Training of voice conversion model, voice conversion method, device and related equipment

Publications (1)

Publication Number Publication Date
CN114882897A true CN114882897A (en) 2022-08-09

Family

ID=82675107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210517643.8A Pending CN114882897A (en) 2022-05-13 2022-05-13 Training of voice conversion model, voice conversion method, device and related equipment

Country Status (1)

Country Link
CN (1) CN114882897A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620748A (en) * 2022-12-06 2023-01-17 北京远鉴信息技术有限公司 Comprehensive training method and device for speech synthesis and false discrimination evaluation
CN116206622A (en) * 2023-05-06 2023-06-02 北京边锋信息技术有限公司 Training and dialect conversion method and device for generating countermeasure network and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620748A (en) * 2022-12-06 2023-01-17 北京远鉴信息技术有限公司 Comprehensive training method and device for speech synthesis and false discrimination evaluation
CN116206622A (en) * 2023-05-06 2023-06-02 北京边锋信息技术有限公司 Training and dialect conversion method and device for generating countermeasure network and electronic equipment
CN116206622B (en) * 2023-05-06 2023-09-08 北京边锋信息技术有限公司 Training and dialect conversion method and device for generating countermeasure network and electronic equipment

Similar Documents

Publication Publication Date Title
CN114882897A (en) Training of voice conversion model, voice conversion method, device and related equipment
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN112466298A (en) Voice detection method and device, electronic equipment and storage medium
CN112837669B (en) Speech synthesis method, device and server
CN112767910A (en) Audio information synthesis method and device, computer readable medium and electronic equipment
US20230230571A1 (en) Audio processing method and apparatus based on artificial intelligence, device, storage medium, and computer program product
CN111444379B (en) Audio feature vector generation method and audio fragment representation model training method
CN113763979A (en) Audio noise reduction and audio noise reduction model processing method, device, equipment and medium
CN112750462A (en) Audio processing method, device and equipment
CN113962965A (en) Image quality evaluation method, device, equipment and storage medium
CN110930996A (en) Model training method, voice recognition method, device, storage medium and equipment
CN113822953A (en) Processing method of image generator, image generation method and device
CN116737895A (en) Data processing method and related equipment
CN116959465A (en) Voice conversion model training method, voice conversion method, device and medium
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN115171666A (en) Speech conversion model training method, speech conversion method, apparatus and medium
CN117935323A (en) Training method of face driving model, video generation method and device
CN108986804A (en) Man-machine dialogue system method, apparatus, user terminal, processing server and system
CN116090536A (en) Neural network optimization method, device, computer equipment and storage medium
CN113990347A (en) Signal processing method, computer equipment and storage medium
CN116741149B (en) Cross-language voice conversion method, training method and related device
CN116741154A (en) Data selection method and device, electronic equipment and storage medium
CN114333846A (en) Speaker identification method, device, electronic equipment and storage medium
CN117011403A (en) Method and device for generating image data, training method and electronic equipment
CN115273807A (en) Ambient sound generation method, ambient sound generation device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination