CN112863529A

CN112863529A - Speaker voice conversion method based on counterstudy and related equipment

Info

Publication number: CN112863529A
Application number: CN202011632876.XA
Authority: CN
Inventors: 梁爽; 缪陈峰; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-28
Anticipated expiration: 2040-12-31
Also published as: WO2022142115A1; CN112863529B

Abstract

The invention relates to the technical field of data processing, and provides a speaker voice conversion method and device based on counterstudy, a computer device and a storage medium, wherein the method comprises the following steps: preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics; inputting MFCC characteristics and fundamental frequency characteristics to an initial speaker voice conversion model for training; invoking an confrontation algorithm to train a content encoder and a content discriminator until a Nash equilibrium state is reached; acquiring a total loss function of the domain discriminator, and detecting whether the total loss function is converged; when the detection result is the convergence of the total loss function, determining a target speaker voice conversion model; acquiring audio to be converted and target audio, calling a content encoder to process the audio to be converted to obtain target content codes, and calling an attribute encoder to process the target audio to obtain target attribute codes; and inputting the target content code and the target attribute code to a generator to obtain the converted speaker voice. The invention can improve the efficiency and quality of voice conversion of the speaker.

Description

Speaker voice conversion method based on counterstudy and related equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a speaker voice conversion method and device based on counterstudy, computer equipment and a storage medium.

Background

With the development of voice technology, people hope to express voices with preferred timbres, and speaker voice conversion technology is favored by more and more people. The speaker-to-speech technique may preserve the text-related information in the original audio and replace the tone in the original audio with the tone of another designated speaker.

Aiming at the voice conversion process of a speaker, in the process of implementing the invention, the inventor finds that the prior art has at least the following problems: the existing speaker voice conversion technology has a plurality of different methods, for example, a gaussian mixture model, a deep neural network and other models are used for speaker voice conversion, but most of the models need parallel linguistic data, that is, different speakers in a training data set need to speak the same sentence, and the prosody of pronunciation and the like are consistent as much as possible, so that the difficulty of data collection is great, the efficiency of speaker conversion is low, and the quality of speaker conversion cannot be ensured.

Therefore, it is desirable to provide a method for converting a speaker's voice, which can improve the efficiency and quality of the conversion of the speaker's voice.

Disclosure of Invention

In view of the foregoing, there is a need for a method, an apparatus, a computer device and a storage medium for conversing speaker voice based on counterstudy, which can improve the efficiency and quality of conversing speaker voice.

A first aspect of the present invention provides a speaker voice conversion method based on counterstudy, the method comprising:

acquiring and preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics, wherein the training data comprises audio corpora of a plurality of speakers;

inputting the MFCC features and the fundamental frequency features to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator;

invoking a countermeasure algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached;

acquiring a total loss function of the domain discriminator, and detecting whether the total loss function is converged;

when the detection result is that the total loss function is converged, determining a target speaker voice conversion model;

acquiring audio to be converted and target audio, calling the content encoder to process the audio to be converted to obtain target content codes, and calling the attribute encoder to process the target audio to obtain target attribute codes;

and inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.

Further, in the above speaker voice conversion method based on counterstudy according to an embodiment of the present invention, the preprocessing the training data to obtain the MFCC features and the fundamental frequency features includes:

calling a world vocoder to extract an initial MFCC feature and an initial fundamental frequency feature of the training data;

determining a target fixed length;

and intercepting the initial MFCC features and the initial fundamental frequency features according to the target fixed length to obtain target MFCC features and target fundamental frequency features.

Further, in the above speaker voice conversion method based on countermeasure learning provided by the embodiment of the present invention, the training the content encoder and the content discriminator by invoking a countermeasure algorithm until reaching a nash equilibrium state includes:

acquiring an initial cross entropy loss function corresponding to the content encoder and the content discriminator;

calling a random gradient descent and back propagation algorithm to optimize the cross entropy loss function to obtain a target cross entropy loss function;

detecting whether the target cross entropy loss function is converged;

when the detection result is that the target cross entropy loss function is converged, the content encoder and the content discriminator reach a Nash equilibrium state;

and when the detection result is that the target cross entropy loss function is not converged, the content encoder and the content discriminator do not reach a Nash equilibrium state.

Further, in the above speaker voice conversion method based on counterstudy provided by the embodiment of the present invention, the obtaining the total loss function of the domain discriminator includes:

acquiring a target sub-loss function of the domain discriminator, wherein the target sub-loss function comprises a target cross entropy loss function, a target consistency loss function, a target domain discriminator loss function, a target reconstruction loss function, a target KL loss function and a target attribute loss function;

determining a preset weight value of each target sub-loss function;

and weighting and processing the preset weight value and the target sub-loss function to obtain a total loss function of the domain discriminator.

Further, in the above speaker voice conversion method based on the counterstudy provided by the embodiment of the present invention, after the determining the target speaker voice conversion model, the method further includes:

calling the attribute encoder to extract the tone information in the training data to obtain an attribute encoding set;

and normalizing the attribute coding set to obtain normal distribution corresponding to the attribute coding set.

Further, in the above speaker voice conversion method based on counterstudy provided by the embodiment of the present invention, before the invoking the content encoder to process the audio to be converted to obtain the target content encoding, and invoking the attribute encoder to process the target audio to obtain the target attribute encoding, the method further includes:

acquiring first source speech of the target content code;

acquiring second source speech of the target attribute code;

detecting whether the first source speech and the second source speech are the same;

and when the detection result is that the first source speech is different from the second source speech, inputting the target content code and the target attribute code to the generator.

Further, in the above method for converting speaker voice based on counterlearning according to an embodiment of the present invention, the inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice includes:

convolution processing the target content code to obtain a first convolution code;

performing convolution processing on the target attribute code to obtain a second convolution code;

splicing the first convolutional code and the second convolutional code to obtain a target convolutional code;

and inputting the target convolutional code into the generator to obtain the converted speaker voice.

The second aspect of the embodiments of the present invention also provides a speaker voice conversion apparatus based on counterstudy, the apparatus including:

the preprocessing module is used for acquiring and preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics, wherein the training data comprises audio corpora of a plurality of speakers;

the model training module is used for inputting the MFCC characteristics and the fundamental frequency characteristics to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator;

the confrontation calling module is used for calling a confrontation algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached;

a convergence detection module, configured to obtain a total loss function of the domain discriminator and detect whether the total loss function converges;

the model determining module is used for determining a target speaker voice conversion model when the detection result is that the total loss function is converged;

the coding processing module is used for acquiring audio to be converted and target audio, calling the content coder to process the audio to be converted to obtain target content coding, and calling the attribute coder to process the target audio to obtain target attribute coding;

and the voice conversion module is used for inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.

A third aspect of the invention provides a computer device comprising a processor for implementing the counterlearning-based speaker speech conversion method when executing a computer program stored in a memory.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the counterlearning-based speaker speech conversion method.

In summary, the method, the device, the computer equipment and the storage medium for conversing speaker voice based on the antagonistic learning realize the conversion of speaker voice under many-to-many non-parallel linguistic data by means of the antagonistic learning, and the method does not need parallel prediction as training data, thereby greatly reducing the difficulty of data acquisition; in addition, the invention introduces a content encoder and an attribute encoder, and decomposes the non-tone information and tone information of the training data by using the content encoder and a content discriminator, thereby synthesizing the audio with higher quality.

Drawings

Fig. 1 is a flowchart of a speaker voice conversion method based on counterstudy according to an embodiment of the present invention.

Fig. 2 is a block diagram of a speaker voice conversion apparatus based on counterlearning according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The speaker voice conversion method based on the counterstudy provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the speaker voice conversion device based on the counterstudy is operated in the computer equipment.

Fig. 1 is a flowchart of a speaker voice conversion method based on counterstudy according to an embodiment of the present invention. The speaker voice conversion method based on the counterstudy specifically comprises the following steps, and the sequence of the steps in the flow chart can be changed and some steps can be omitted according to different requirements.

S11, training data are collected and preprocessed to obtain MFCC characteristics and fundamental frequency characteristics, and the training data comprise audio corpora of a plurality of speakers.

In at least one embodiment of the present application, the training data may be non-parallel text data, including audio corpora of several speakers, and the audio corpora of several speakers do not need to have the same text. Illustratively, 30 speakers are selected, each speaker respectively records 400 sentences of audio corpus of different texts, and the audio corpus recorded by the 30 speakers is used as training data.

Optionally, the preprocessing the training data to obtain MFCC features and fundamental frequency features includes:

determining a target fixed length;

The MFCC is a classic audio feature and is often applied to the fields of speech recognition, audio data classification and the like. The MFCC feature vector includes basic features of 12 to 16 dimensions, one-dimensional energy features, and first-order difference and second-order difference features of the basic features and the energy features, so the MFCC feature vector may have 39 dimensions, 42 dimensions, 45 dimensions, 48 dimensions, and 51 dimensions. In general, when the MFCC feature vector extraction is performed on audio data, a 39-dimensional MFCC feature vector is preferably used. The data processing process for preprocessing the training data to obtain the MFCC features and the fundamental frequency features is the prior art and is not described herein again. The target fixed length is preset and is used for ensuring that training data input to the initial speaker conversion model is of a fixed dimension length, and the target fixed length can be set according to actual requirements without limitation.

In one embodiment, the audio data collected in the different domains may itself have noise, and therefore, before the invoking the world vocoder extracts the initial MFCC features of the training data, the method further comprises: and cleaning the training data, denoising and unifying the sampling rate of the audio.

The invention only needs the audio corpus as the training data, does not need to carry out text labeling processing on the audio corpus, and improves the efficiency of the speaker voice conversion processing.

And S12, inputting the MFCC features and the fundamental frequency features to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator.

In at least one embodiment of the present invention, the MFCC features and the fundamental frequency features are input to an initial speaker speech conversion model for training, where the initial speaker speech conversion model includes a content encoder, an attribute encoder, a content discriminator, a generator, and a domain discriminator.

Wherein, the content encoder is configured to extract non-timbre information in the training data to obtain a content code, that is, the content encoder is configured to receive the MFCC feature and the fundamental frequency feature; in one embodiment, the content encoder includes a plurality of CNN layers, each CNN layer being followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 7, and when the number of the CNN layers is 7, the content encoder extracts the content encoding effect best. When the number of CNN layers is less than 7, the loss value of the content encoder becomes high; when the number of CNN layers is greater than 7, the content encoder becomes large, and the time required for each operation becomes long.

The attribute encoder is used for extracting tone color information in training data to obtain attribute codes, and the input of the attribute encoder is the MFCC features and the fundamental frequency features; in one embodiment, the attribute encoder includes a plurality of CNN layers, the activation function is a ReLU function, and each CNN layer is followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 7, and when the number of the CNN layers is 7, the effect of the attribute encoder extracting the attribute encoding is the best. When the number of CNN layers is less than 7, the loss value of the attribute encoder becomes high; when the number of CNN layers is greater than 7, the attribute encoder becomes large, and the time required for each operation becomes long.

The content discriminator is used for receiving content codes and predicting the probability of a speaker corresponding to the content codes, and the higher the probability of the speaker is, the higher the possibility that the content codes are output by the speaker is; the lower the speaker probability, the less likely the content encoding is output by the speaker; in one embodiment, the content discriminator includes several CNN layers, the activation function being a leakage ReLU function, each CNN layer followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 5, and when the number of the CNN layers is 5, the content discriminator may predict the speaker probability corresponding to the content encoding to have the best effect. When the number of CNN layers is less than 5, the loss value of the content discriminator becomes high; when the number of CNN layers is greater than 5, the content discriminator becomes large, and the time required for each operation becomes long.

The generator is used for receiving any content code and attribute code and carrying out speaker voice conversion based on the content code and the attribute code; in one embodiment, the generator includes a number of CNN layers and a number of deconvolution layers. Illustratively, the number of CNN layers is 5, the number of deconvolution layers is 5, and when the number of CNN layers is 5, the effect of the speaker voice conversion performed by the generator is the best. When the number of CNN layers is less than 5, the loss value of the generator becomes high; when the number of CNN layers is greater than 5, the generator becomes large, and the time required for each operation becomes long.

The domain discriminator is used for obtaining the total loss function of the model. In one embodiment, the domain discriminator includes a plurality of CNN layers and a plurality of average pooling layers. Illustratively, the number of the CNN layers is 6, the number of the average pooling layers is 6, and when the number of the CNN layers is 6, the domain arbiter has the best effect of obtaining the total loss function of the model. When the number of CNN layers is less than 6, the loss value of the domain discriminator becomes high; when the number of CNN layers is greater than 6, the domain discriminator becomes large, and the time required for each operation becomes long.

And S13, invoking a countermeasure algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached.

In at least one embodiment of the present invention, in the training process of the initial speaker voice conversion model, in order to decompose the non-timbre information in the training data from the original audio, the content encoder and the content discriminator are trained in a counterlearning manner, the content discriminator is used to distinguish which speaker the received content encoding belongs to, and the content encoder expects that the content discriminator cannot distinguish which speaker the output thereof belongs to, and finally the content encoder and the content discriminator reach a nash equilibrium state. When the nash balance state is reached, the content discriminator cannot judge which speaker the content code belongs to, which means that the content code contains no tone information at all.

Optionally, the invoking the countermeasure algorithm to train the content encoder and the content discriminator until the nash equilibrium state is reached comprises:

detecting whether the target cross entropy loss function is converged;

Wherein the initial cross entropy loss function is as follows:

wherein E is^cAs a content encoder, D^cIs a content discriminator, x_iIs sampled audio.

For the initial cross-entropy loss function, the content encoder wants to maximize the function and the content discriminator wants to minimize the function. Optimizing the cross entropy loss function through a random gradient descent and back propagation algorithm to make the loss function converge, so that the content encoder and the content discriminator reach a Nash equilibrium state; otherwise, the content encoder and the content discriminator do not reach the Nash equilibrium state.

The invention leads the content encoder and the content discriminator to reach a Nash equilibrium state through model training, at the moment, the output of the content encoder does not contain any speaker information, and the content discriminator can not judge which speaker the content information comes from, thereby realizing the decoupling of the content and the attribute, avoiding the interference of other redundant information in the process of realizing speaker conversion, reducing the model noise and improving the quality of speaker voice conversion.

S14, obtaining the total loss function of the domain discriminator, and detecting whether the total loss function converges, and if the total loss function converges, executing step S15.

In at least one embodiment of the present invention, the total loss function of the domain discriminator includes a plurality of sections, and the total loss function can be obtained by weighting and processing the sub-loss functions of the respective sections. Determining whether the model is trained by detecting whether the total loss function is converged, wherein it can be understood that when the detection result is that the total loss function is converged, the model is trained; and when the detection result is that the total loss function is not converged, determining that the model is not trained, and continuing iterative training until the total loss function is converged.

Optionally, the obtaining the total loss function of the domain discriminator includes:

determining a preset weight value of each target sub-loss function;

The target cross entropy loss function is as described in the above formula 1, and is not described herein again.

In one embodiment, (x, y) is first defined as any two samples, i.e., two MFCC features and two fundamental features. Defining corresponding speaker artifacts

After passing through the content encoder and the attribute encoder, the content encoding of the content encoder and the attribute encoder can be respectively obtained

And attribute coding

Interleaving the content code and the attribute code into the generator to obtain

Then, u and v are cross-fed into the generator to obtain

Can be easily seen

The content encoding of (a) is from v, the attribute encoding is from u, and the content encoding of v is from x, the attribute encoding of u is from x, thus

Both the content coding and the attribute coding of (2) are from x, thus

Should be exactly the same as x. In the same way

And y should also be identical. Based on the consistency principle, a target consistency loss function is obtained, which is shown as the following formula:

wherein G is a generator, E^cAs a content encoder, E^aBeing attribute encoders, x_iAnd y_iIs sampled audio.

In one embodiment, the loss function of the target domain discriminator is as follows:

wherein D is^domainIs a domain discriminator and G is a generator.

In one embodiment, the target reconstruction loss function is given by:

wherein G is a generator, E^cAs a content encoder, E^aBeing attribute encoders, x_iIs sampled audio.

In an embodiment, to define a normal distribution with an attribute code of N (0,1), the target KL penalty function is therefore the KL divergence of the attribute code distribution function and the N (0,1) distribution function.

In one embodiment, a z is randomly sampled from the N (0,1) normal distribution as an attribute code, the content code of any sample x is selected as the content code, and the content code and the speaker code of x are sent to a generator together, so that the method can obtain the code

Then will be

Into the attribute encoder, the result should still be z, so the target attribute loss function should be:

wherein G is a generator, E^cAs a content encoder, E^aBeing attribute encoders, z_iIs sampled audio.

After all the target sub-loss functions are determined, a preset weight value of each target sub-loss function is determined, and the preset weight value and the target sub-loss functions are weighted and processed to obtain a total loss function of the domain discriminator. The preset weight value is a value adjusted according to an experimental result, and is not limited herein.

And S15, determining the voice conversion model of the target speaker.

In at least one embodiment of the present invention, when the detection result is that the total loss function is converged, each model parameter value of the current speaker voice conversion model is determined, and the target speaker voice conversion model is determined based on each model parameter value, and then the target speaker voice conversion model can be called to perform speaker voice conversion.

Optionally, after the determining the voice conversion model of the target speaker, when the attribute encoder is called to extract the tone color information in the training data to obtain the attribute encoding, the method further includes, in order to synthesize an output of a multi-mode, increasing the number of tone color information:

The data processing process of normalizing the attribute encoder to obtain the normal distribution N (0,1) corresponding to the attribute encoder is the prior art, and is not described herein again. By carrying out normal distribution processing on the attribute coding set, a plurality of pieces of simulated tone color information can be obtained, the number of tone color information is increased, more choices are provided for voice conversion of a speaker, and the flexibility of voice conversion processing of the speaker is improved.

S16, obtaining the audio to be converted and the target audio, calling the content encoder to process the audio to be converted to obtain the target content code, and calling the attribute encoder to process the target audio to obtain the target attribute code.

In at least one embodiment of the present invention, the audio to be converted refers to an audio that needs to be subjected to speaker voice conversion processing, the target audio refers to an audio that includes timbre information of a target speaker, the target audio may be an audio corpus in the training data, or any attribute code obtained by sampling from a normal distribution, or may be timbre information of an invisible speaker, and the invisible speaker may be selected by a user and is not audio information of the training data nor the normal distribution data. The required speaker voice can be obtained by combining the tone information in the target audio and the content code in the audio to be converted.

Optionally, before the invoking the content encoder to process the audio to be converted to obtain a target content encoding, and the invoking the attribute encoder to process the target audio to obtain a target attribute encoding, the method further includes:

acquiring first source speech of the target content code;

acquiring second source speech of the target attribute code;

The first source speech refers to the speaker information coded corresponding to the target content, and the second source speech refers to the speaker information coded corresponding to the target attribute. By detecting whether the first source speech and the second source speech are the same; when the detection result is that the first source speech is the same as the second source speech, the speaker speech conversion processing is not required to be executed; when the detection result is that the first source speech and the second source speech are different, speaker speech conversion processing can be executed, and the target content code and the target attribute code are input to the generator.

In at least one embodiment of the invention, the attribute encoder is called to process the target audio to obtain the target attribute code, and the attribute encoder is directly utilized to extract the information of the speaker, so that the voice conversion of the invisible speaker can be realized, zero learning is realized, and the flexibility of the voice conversion scene of the speaker can be improved.

And S17, inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.

In at least one embodiment of the present invention, the target content code and the target attribute code are input to the generator for conversion processing, so as to obtain the speaker voice.

Optionally, the inputting the target content code and the target attribute code into the generator, and the obtaining the converted speaker voice includes:

And performing 1 × 1 convolution operation on the target content code to obtain a first convolution code, and performing 1 × 1 convolution operation on the target attribute code to obtain a second convolution code.

By adopting the method, the voice conversion of the speaker under the many-to-many non-parallel linguistic data is realized by utilizing a counterstudy mode, and the method does not need parallel prediction as training data, so that the difficulty of data acquisition can be greatly reduced; in addition, the invention introduces a content encoder and an attribute encoder, and decomposes the non-tone information and tone information of the training data by using the content encoder and a content discriminator, thereby synthesizing the audio with higher quality.

It is emphasized that the training data may be stored in the nodes of the blockchain in order to further ensure privacy and security of the training data.

In some embodiments, the competitive learning based speaker speech conversion apparatus 20 may include a plurality of functional modules comprised of computer program segments. The computer program of each program segment in the anti-learning based speaker speech conversion apparatus 20 may be stored in a memory of a computer device and executed by at least one processor to perform (see detailed description of fig. 1) the functions suitable for anti-learning based speaker speech conversion.

In this embodiment, the speaker voice conversion apparatus 20 based on the counterstudy can be divided into a plurality of functional modules according to the functions performed by the speaker voice conversion apparatus. The functional module may include: a preprocessing module 201, a model training module 202, a confrontation calling module 203, a convergence detection module 204, a model determination module 205, an encoding processing module 206, and a speech conversion module 207. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The preprocessing module 201 is configured to acquire and preprocess training data to obtain MFCC features and fundamental frequency features, where the training data includes audio corpora of a plurality of speakers.

determining a target fixed length;

The MFCC is a classic audio feature and is often applied to the fields of speech recognition, audio data classification and the like. The MFCC feature vector includes basic features of 12 to 16 dimensions, one-dimensional energy features, and first-order difference and second-order difference features of the basic features and the energy features, so the MFCC feature vector may have 39 dimensions, 42 dimensions, 45 dimensions, 48 dimensions, and 51 dimensions. In general, when the MFCC feature vector extraction is performed on audio data, a 39-dimensional MFCC feature vector is preferably used. The data processing process for preprocessing the training data to obtain the MFCC features and the fundamental frequency features is the prior art and is not described herein again. The target fixed length is preset and is used for ensuring that training data input to the initial speaker voice conversion model is of a fixed dimension length, and the target fixed length can be set according to actual requirements without limitation.

In one embodiment, the audio data collected in different domains may have noise in itself, and therefore, before the calling world vocoder extracts the initial MFCC feature of the training data, the preprocessing module 201 further includes: and cleaning the training data, denoising and unifying the sampling rate of the audio.

The model training module 202 is configured to input the MFCC features and the fundamental frequency features to an initial speaker speech conversion model for training, where the initial speaker speech conversion model includes a content encoder, an attribute encoder, a content discriminator, a generator, and a domain discriminator.

The confrontation calling module 203 is used for calling the confrontation algorithm to train the content encoder and the content discriminator until reaching the nash equilibrium state.

detecting whether the target cross entropy loss function is converged;

Wherein the initial cross entropy loss function is as follows:

The convergence detection module 204 is configured to obtain a total loss function of the domain discriminator, and detect whether the total loss function converges.

determining a preset weight value of each target sub-loss function;

And attribute coding

Then, u and v are cross-fed into the generator to obtain

Can be easily seen

Both the content coding and the attribute coding of (2) are from x, thus

Should be exactly the same as x. In the same way

wherein D is^domainIs a domain discriminator and G is a generator.

In one embodiment, the target reconstruction loss function is given by:

In one embodiment, one z is randomly sampled from N (0,1) normal distribution as attribute code, and the content of any sample x is selectedThe encoding is fed into the generator as a content encoding together with the speaker encoding of x, then it is obtained

Then will be

The model determining module 205 is configured to determine a voice conversion model of the target speaker when the detection result is that the total loss function converges.

Optionally, after the determining the voice conversion model of the target speaker, when the attribute encoder is called to extract the tone color information in the training data to obtain the attribute encoding, in order to synthesize the output of the multi-mode, the model determining module 205 further includes:

The encoding processing module 206 is configured to obtain an audio to be converted and a target audio, call the content encoder to process the audio to be converted to obtain a target content encoding, and call the attribute encoder to process the target audio to obtain a target attribute encoding.

Optionally, before the invoking the content encoder to process the audio to be converted to obtain the target content encoding, and the invoking the attribute encoder to process the target audio to obtain the target attribute encoding, the encoding processing module 206 further includes:

acquiring first source speech of the target content code;

acquiring second source speech of the target attribute code;

The voice conversion module 207 is configured to input the target content code and the target attribute code to the generator, so as to obtain a converted speaker voice.

Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.

It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 does not constitute a limitation of the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.

In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.

It should be noted that the computer device 3 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

In some embodiments, the memory 31 has stored therein a computer program that, when executed by the at least one processor 32, performs all or a portion of the steps of the counterlearning-based speaker-to-speech conversion method as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or a portion of the steps of the counterlearning-based speaker speech conversion method described in embodiments of the present invention; or implement all or part of the functionality of the speaker-to-speech conversion apparatus based on counterlearning. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.

In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or devices in the present invention may also be implemented by one unit or device through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for speaker voice conversion based on antagonistic learning, the method comprising:

2. The method of claim 1, wherein preprocessing the training data to obtain MFCC features and fundamental frequency features comprises:

determining a target fixed length;

3. The method of warrior-based speech conversion according to claim 1, wherein said invoking a warrior algorithm to train the content encoder and the content discriminator until a nash-balanced state is reached comprises:

detecting whether the target cross entropy loss function is converged;

4. The method of claim 1, wherein obtaining the overall loss function of the domain discriminator comprises:

determining a preset weight value of each target sub-loss function;

5. The method for discourse learning based speaker speech conversion according to claim 1, wherein after the determining the target speaker speech conversion model, the method further comprises:

6. The method as claimed in claim 1, wherein before the invoking the content encoder to process the audio to be converted to obtain a target content encoding and the invoking the property encoder to process the target audio to obtain a target property encoding, the method further comprises:

acquiring first source speech of the target content code;

acquiring second source speech of the target attribute code;

7. The method of claim 1, wherein inputting the target content encoding and the target attribute encoding to the generator results in a converted speaker speech comprises:

8. A speaker voice conversion apparatus based on counterstudy, the apparatus comprising:

9. A computer device comprising a processor for implementing the counterlearning-based speaker speech conversion method according to any one of claims 1 to 7 when executing a computer program stored in a memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the counterlearning-based speaker speech conversion method according to any one of claims 1 to 7.