CN112863529A - Speaker voice conversion method based on counterstudy and related equipment - Google Patents

Speaker voice conversion method based on counterstudy and related equipment Download PDF

Info

Publication number
CN112863529A
CN112863529A CN202011632876.XA CN202011632876A CN112863529A CN 112863529 A CN112863529 A CN 112863529A CN 202011632876 A CN202011632876 A CN 202011632876A CN 112863529 A CN112863529 A CN 112863529A
Authority
CN
China
Prior art keywords
target
content
loss function
attribute
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011632876.XA
Other languages
Chinese (zh)
Other versions
CN112863529B (en
Inventor
梁爽
缪陈峰
马骏
王少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011632876.XA priority Critical patent/CN112863529B/en
Priority to PCT/CN2021/096887 priority patent/WO2022142115A1/en
Publication of CN112863529A publication Critical patent/CN112863529A/en
Application granted granted Critical
Publication of CN112863529B publication Critical patent/CN112863529B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to the technical field of data processing, and provides a speaker voice conversion method and device based on counterstudy, a computer device and a storage medium, wherein the method comprises the following steps: preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics; inputting MFCC characteristics and fundamental frequency characteristics to an initial speaker voice conversion model for training; invoking an confrontation algorithm to train a content encoder and a content discriminator until a Nash equilibrium state is reached; acquiring a total loss function of the domain discriminator, and detecting whether the total loss function is converged; when the detection result is the convergence of the total loss function, determining a target speaker voice conversion model; acquiring audio to be converted and target audio, calling a content encoder to process the audio to be converted to obtain target content codes, and calling an attribute encoder to process the target audio to obtain target attribute codes; and inputting the target content code and the target attribute code to a generator to obtain the converted speaker voice. The invention can improve the efficiency and quality of voice conversion of the speaker.

Description

Speaker voice conversion method based on counterstudy and related equipment
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a speaker voice conversion method and device based on counterstudy, computer equipment and a storage medium.
Background
With the development of voice technology, people hope to express voices with preferred timbres, and speaker voice conversion technology is favored by more and more people. The speaker-to-speech technique may preserve the text-related information in the original audio and replace the tone in the original audio with the tone of another designated speaker.
Aiming at the voice conversion process of a speaker, in the process of implementing the invention, the inventor finds that the prior art has at least the following problems: the existing speaker voice conversion technology has a plurality of different methods, for example, a gaussian mixture model, a deep neural network and other models are used for speaker voice conversion, but most of the models need parallel linguistic data, that is, different speakers in a training data set need to speak the same sentence, and the prosody of pronunciation and the like are consistent as much as possible, so that the difficulty of data collection is great, the efficiency of speaker conversion is low, and the quality of speaker conversion cannot be ensured.
Therefore, it is desirable to provide a method for converting a speaker's voice, which can improve the efficiency and quality of the conversion of the speaker's voice.
Disclosure of Invention
In view of the foregoing, there is a need for a method, an apparatus, a computer device and a storage medium for conversing speaker voice based on counterstudy, which can improve the efficiency and quality of conversing speaker voice.
A first aspect of the present invention provides a speaker voice conversion method based on counterstudy, the method comprising:
acquiring and preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics, wherein the training data comprises audio corpora of a plurality of speakers;
inputting the MFCC features and the fundamental frequency features to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator;
invoking a countermeasure algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached;
acquiring a total loss function of the domain discriminator, and detecting whether the total loss function is converged;
when the detection result is that the total loss function is converged, determining a target speaker voice conversion model;
acquiring audio to be converted and target audio, calling the content encoder to process the audio to be converted to obtain target content codes, and calling the attribute encoder to process the target audio to obtain target attribute codes;
and inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.
Further, in the above speaker voice conversion method based on counterstudy according to an embodiment of the present invention, the preprocessing the training data to obtain the MFCC features and the fundamental frequency features includes:
calling a world vocoder to extract an initial MFCC feature and an initial fundamental frequency feature of the training data;
determining a target fixed length;
and intercepting the initial MFCC features and the initial fundamental frequency features according to the target fixed length to obtain target MFCC features and target fundamental frequency features.
Further, in the above speaker voice conversion method based on countermeasure learning provided by the embodiment of the present invention, the training the content encoder and the content discriminator by invoking a countermeasure algorithm until reaching a nash equilibrium state includes:
acquiring an initial cross entropy loss function corresponding to the content encoder and the content discriminator;
calling a random gradient descent and back propagation algorithm to optimize the cross entropy loss function to obtain a target cross entropy loss function;
detecting whether the target cross entropy loss function is converged;
when the detection result is that the target cross entropy loss function is converged, the content encoder and the content discriminator reach a Nash equilibrium state;
and when the detection result is that the target cross entropy loss function is not converged, the content encoder and the content discriminator do not reach a Nash equilibrium state.
Further, in the above speaker voice conversion method based on counterstudy provided by the embodiment of the present invention, the obtaining the total loss function of the domain discriminator includes:
acquiring a target sub-loss function of the domain discriminator, wherein the target sub-loss function comprises a target cross entropy loss function, a target consistency loss function, a target domain discriminator loss function, a target reconstruction loss function, a target KL loss function and a target attribute loss function;
determining a preset weight value of each target sub-loss function;
and weighting and processing the preset weight value and the target sub-loss function to obtain a total loss function of the domain discriminator.
Further, in the above speaker voice conversion method based on the counterstudy provided by the embodiment of the present invention, after the determining the target speaker voice conversion model, the method further includes:
calling the attribute encoder to extract the tone information in the training data to obtain an attribute encoding set;
and normalizing the attribute coding set to obtain normal distribution corresponding to the attribute coding set.
Further, in the above speaker voice conversion method based on counterstudy provided by the embodiment of the present invention, before the invoking the content encoder to process the audio to be converted to obtain the target content encoding, and invoking the attribute encoder to process the target audio to obtain the target attribute encoding, the method further includes:
acquiring first source speech of the target content code;
acquiring second source speech of the target attribute code;
detecting whether the first source speech and the second source speech are the same;
and when the detection result is that the first source speech is different from the second source speech, inputting the target content code and the target attribute code to the generator.
Further, in the above method for converting speaker voice based on counterlearning according to an embodiment of the present invention, the inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice includes:
convolution processing the target content code to obtain a first convolution code;
performing convolution processing on the target attribute code to obtain a second convolution code;
splicing the first convolutional code and the second convolutional code to obtain a target convolutional code;
and inputting the target convolutional code into the generator to obtain the converted speaker voice.
The second aspect of the embodiments of the present invention also provides a speaker voice conversion apparatus based on counterstudy, the apparatus including:
the preprocessing module is used for acquiring and preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics, wherein the training data comprises audio corpora of a plurality of speakers;
the model training module is used for inputting the MFCC characteristics and the fundamental frequency characteristics to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator;
the confrontation calling module is used for calling a confrontation algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached;
a convergence detection module, configured to obtain a total loss function of the domain discriminator and detect whether the total loss function converges;
the model determining module is used for determining a target speaker voice conversion model when the detection result is that the total loss function is converged;
the coding processing module is used for acquiring audio to be converted and target audio, calling the content coder to process the audio to be converted to obtain target content coding, and calling the attribute coder to process the target audio to obtain target attribute coding;
and the voice conversion module is used for inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.
A third aspect of the invention provides a computer device comprising a processor for implementing the counterlearning-based speaker speech conversion method when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the counterlearning-based speaker speech conversion method.
In summary, the method, the device, the computer equipment and the storage medium for conversing speaker voice based on the antagonistic learning realize the conversion of speaker voice under many-to-many non-parallel linguistic data by means of the antagonistic learning, and the method does not need parallel prediction as training data, thereby greatly reducing the difficulty of data acquisition; in addition, the invention introduces a content encoder and an attribute encoder, and decomposes the non-tone information and tone information of the training data by using the content encoder and a content discriminator, thereby synthesizing the audio with higher quality.
Drawings
Fig. 1 is a flowchart of a speaker voice conversion method based on counterstudy according to an embodiment of the present invention.
Fig. 2 is a block diagram of a speaker voice conversion apparatus based on counterlearning according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The speaker voice conversion method based on the counterstudy provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the speaker voice conversion device based on the counterstudy is operated in the computer equipment.
Fig. 1 is a flowchart of a speaker voice conversion method based on counterstudy according to an embodiment of the present invention. The speaker voice conversion method based on the counterstudy specifically comprises the following steps, and the sequence of the steps in the flow chart can be changed and some steps can be omitted according to different requirements.
S11, training data are collected and preprocessed to obtain MFCC characteristics and fundamental frequency characteristics, and the training data comprise audio corpora of a plurality of speakers.
In at least one embodiment of the present application, the training data may be non-parallel text data, including audio corpora of several speakers, and the audio corpora of several speakers do not need to have the same text. Illustratively, 30 speakers are selected, each speaker respectively records 400 sentences of audio corpus of different texts, and the audio corpus recorded by the 30 speakers is used as training data.
Optionally, the preprocessing the training data to obtain MFCC features and fundamental frequency features includes:
calling a world vocoder to extract an initial MFCC feature and an initial fundamental frequency feature of the training data;
determining a target fixed length;
and intercepting the initial MFCC features and the initial fundamental frequency features according to the target fixed length to obtain target MFCC features and target fundamental frequency features.
The MFCC is a classic audio feature and is often applied to the fields of speech recognition, audio data classification and the like. The MFCC feature vector includes basic features of 12 to 16 dimensions, one-dimensional energy features, and first-order difference and second-order difference features of the basic features and the energy features, so the MFCC feature vector may have 39 dimensions, 42 dimensions, 45 dimensions, 48 dimensions, and 51 dimensions. In general, when the MFCC feature vector extraction is performed on audio data, a 39-dimensional MFCC feature vector is preferably used. The data processing process for preprocessing the training data to obtain the MFCC features and the fundamental frequency features is the prior art and is not described herein again. The target fixed length is preset and is used for ensuring that training data input to the initial speaker conversion model is of a fixed dimension length, and the target fixed length can be set according to actual requirements without limitation.
In one embodiment, the audio data collected in the different domains may itself have noise, and therefore, before the invoking the world vocoder extracts the initial MFCC features of the training data, the method further comprises: and cleaning the training data, denoising and unifying the sampling rate of the audio.
The invention only needs the audio corpus as the training data, does not need to carry out text labeling processing on the audio corpus, and improves the efficiency of the speaker voice conversion processing.
And S12, inputting the MFCC features and the fundamental frequency features to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator.
In at least one embodiment of the present invention, the MFCC features and the fundamental frequency features are input to an initial speaker speech conversion model for training, where the initial speaker speech conversion model includes a content encoder, an attribute encoder, a content discriminator, a generator, and a domain discriminator.
Wherein, the content encoder is configured to extract non-timbre information in the training data to obtain a content code, that is, the content encoder is configured to receive the MFCC feature and the fundamental frequency feature; in one embodiment, the content encoder includes a plurality of CNN layers, each CNN layer being followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 7, and when the number of the CNN layers is 7, the content encoder extracts the content encoding effect best. When the number of CNN layers is less than 7, the loss value of the content encoder becomes high; when the number of CNN layers is greater than 7, the content encoder becomes large, and the time required for each operation becomes long.
The attribute encoder is used for extracting tone color information in training data to obtain attribute codes, and the input of the attribute encoder is the MFCC features and the fundamental frequency features; in one embodiment, the attribute encoder includes a plurality of CNN layers, the activation function is a ReLU function, and each CNN layer is followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 7, and when the number of the CNN layers is 7, the effect of the attribute encoder extracting the attribute encoding is the best. When the number of CNN layers is less than 7, the loss value of the attribute encoder becomes high; when the number of CNN layers is greater than 7, the attribute encoder becomes large, and the time required for each operation becomes long.
The content discriminator is used for receiving content codes and predicting the probability of a speaker corresponding to the content codes, and the higher the probability of the speaker is, the higher the possibility that the content codes are output by the speaker is; the lower the speaker probability, the less likely the content encoding is output by the speaker; in one embodiment, the content discriminator includes several CNN layers, the activation function being a leakage ReLU function, each CNN layer followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 5, and when the number of the CNN layers is 5, the content discriminator may predict the speaker probability corresponding to the content encoding to have the best effect. When the number of CNN layers is less than 5, the loss value of the content discriminator becomes high; when the number of CNN layers is greater than 5, the content discriminator becomes large, and the time required for each operation becomes long.
The generator is used for receiving any content code and attribute code and carrying out speaker voice conversion based on the content code and the attribute code; in one embodiment, the generator includes a number of CNN layers and a number of deconvolution layers. Illustratively, the number of CNN layers is 5, the number of deconvolution layers is 5, and when the number of CNN layers is 5, the effect of the speaker voice conversion performed by the generator is the best. When the number of CNN layers is less than 5, the loss value of the generator becomes high; when the number of CNN layers is greater than 5, the generator becomes large, and the time required for each operation becomes long.
The domain discriminator is used for obtaining the total loss function of the model. In one embodiment, the domain discriminator includes a plurality of CNN layers and a plurality of average pooling layers. Illustratively, the number of the CNN layers is 6, the number of the average pooling layers is 6, and when the number of the CNN layers is 6, the domain arbiter has the best effect of obtaining the total loss function of the model. When the number of CNN layers is less than 6, the loss value of the domain discriminator becomes high; when the number of CNN layers is greater than 6, the domain discriminator becomes large, and the time required for each operation becomes long.
And S13, invoking a countermeasure algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached.
In at least one embodiment of the present invention, in the training process of the initial speaker voice conversion model, in order to decompose the non-timbre information in the training data from the original audio, the content encoder and the content discriminator are trained in a counterlearning manner, the content discriminator is used to distinguish which speaker the received content encoding belongs to, and the content encoder expects that the content discriminator cannot distinguish which speaker the output thereof belongs to, and finally the content encoder and the content discriminator reach a nash equilibrium state. When the nash balance state is reached, the content discriminator cannot judge which speaker the content code belongs to, which means that the content code contains no tone information at all.
Optionally, the invoking the countermeasure algorithm to train the content encoder and the content discriminator until the nash equilibrium state is reached comprises:
acquiring an initial cross entropy loss function corresponding to the content encoder and the content discriminator;
calling a random gradient descent and back propagation algorithm to optimize the cross entropy loss function to obtain a target cross entropy loss function;
detecting whether the target cross entropy loss function is converged;
when the detection result is that the target cross entropy loss function is converged, the content encoder and the content discriminator reach a Nash equilibrium state;
and when the detection result is that the target cross entropy loss function is not converged, the content encoder and the content discriminator do not reach a Nash equilibrium state.
Wherein the initial cross entropy loss function is as follows:
Figure BDA0002880491970000091
wherein E iscAs a content encoder, DcIs a content discriminator, xiIs sampled audio.
For the initial cross-entropy loss function, the content encoder wants to maximize the function and the content discriminator wants to minimize the function. Optimizing the cross entropy loss function through a random gradient descent and back propagation algorithm to make the loss function converge, so that the content encoder and the content discriminator reach a Nash equilibrium state; otherwise, the content encoder and the content discriminator do not reach the Nash equilibrium state.
The invention leads the content encoder and the content discriminator to reach a Nash equilibrium state through model training, at the moment, the output of the content encoder does not contain any speaker information, and the content discriminator can not judge which speaker the content information comes from, thereby realizing the decoupling of the content and the attribute, avoiding the interference of other redundant information in the process of realizing speaker conversion, reducing the model noise and improving the quality of speaker voice conversion.
S14, obtaining the total loss function of the domain discriminator, and detecting whether the total loss function converges, and if the total loss function converges, executing step S15.
In at least one embodiment of the present invention, the total loss function of the domain discriminator includes a plurality of sections, and the total loss function can be obtained by weighting and processing the sub-loss functions of the respective sections. Determining whether the model is trained by detecting whether the total loss function is converged, wherein it can be understood that when the detection result is that the total loss function is converged, the model is trained; and when the detection result is that the total loss function is not converged, determining that the model is not trained, and continuing iterative training until the total loss function is converged.
Optionally, the obtaining the total loss function of the domain discriminator includes:
acquiring a target sub-loss function of the domain discriminator, wherein the target sub-loss function comprises a target cross entropy loss function, a target consistency loss function, a target domain discriminator loss function, a target reconstruction loss function, a target KL loss function and a target attribute loss function;
determining a preset weight value of each target sub-loss function;
and weighting and processing the preset weight value and the target sub-loss function to obtain a total loss function of the domain discriminator.
The target cross entropy loss function is as described in the above formula 1, and is not described herein again.
In one embodiment, (x, y) is first defined as any two samples, i.e., two MFCC features and two fundamental features. Defining corresponding speaker artifacts
Figure BDA0002880491970000101
After passing through the content encoder and the attribute encoder, the content encoding of the content encoder and the attribute encoder can be respectively obtained
Figure BDA0002880491970000102
And attribute coding
Figure BDA0002880491970000103
Interleaving the content code and the attribute code into the generator to obtain
Figure BDA0002880491970000104
Then, u and v are cross-fed into the generator to obtain
Figure BDA0002880491970000105
Can be easily seen
Figure BDA0002880491970000106
The content encoding of (a) is from v, the attribute encoding is from u, and the content encoding of v is from x, the attribute encoding of u is from x, thus
Figure BDA0002880491970000107
Both the content coding and the attribute coding of (2) are from x, thus
Figure BDA0002880491970000108
Should be exactly the same as x. In the same way
Figure BDA0002880491970000109
And y should also be identical. Based on the consistency principle, a target consistency loss function is obtained, which is shown as the following formula:
Figure BDA00028804919700001010
wherein G is a generator, EcAs a content encoder, EaBeing attribute encoders, xiAnd yiIs sampled audio.
In one embodiment, the loss function of the target domain discriminator is as follows:
Figure BDA00028804919700001011
wherein D isdomainIs a domain discriminator and G is a generator.
In one embodiment, the target reconstruction loss function is given by:
Figure BDA0002880491970000111
wherein G is a generator, EcAs a content encoder, EaBeing attribute encoders, xiIs sampled audio.
In an embodiment, to define a normal distribution with an attribute code of N (0,1), the target KL penalty function is therefore the KL divergence of the attribute code distribution function and the N (0,1) distribution function.
In one embodiment, a z is randomly sampled from the N (0,1) normal distribution as an attribute code, the content code of any sample x is selected as the content code, and the content code and the speaker code of x are sent to a generator together, so that the method can obtain the code
Figure BDA0002880491970000112
Then will be
Figure BDA0002880491970000113
Into the attribute encoder, the result should still be z, so the target attribute loss function should be:
Figure BDA0002880491970000114
wherein G is a generator, EcAs a content encoder, EaBeing attribute encoders, ziIs sampled audio.
After all the target sub-loss functions are determined, a preset weight value of each target sub-loss function is determined, and the preset weight value and the target sub-loss functions are weighted and processed to obtain a total loss function of the domain discriminator. The preset weight value is a value adjusted according to an experimental result, and is not limited herein.
And S15, determining the voice conversion model of the target speaker.
In at least one embodiment of the present invention, when the detection result is that the total loss function is converged, each model parameter value of the current speaker voice conversion model is determined, and the target speaker voice conversion model is determined based on each model parameter value, and then the target speaker voice conversion model can be called to perform speaker voice conversion.
Optionally, after the determining the voice conversion model of the target speaker, when the attribute encoder is called to extract the tone color information in the training data to obtain the attribute encoding, the method further includes, in order to synthesize an output of a multi-mode, increasing the number of tone color information:
calling the attribute encoder to extract the tone information in the training data to obtain an attribute encoding set;
and normalizing the attribute coding set to obtain normal distribution corresponding to the attribute coding set.
The data processing process of normalizing the attribute encoder to obtain the normal distribution N (0,1) corresponding to the attribute encoder is the prior art, and is not described herein again. By carrying out normal distribution processing on the attribute coding set, a plurality of pieces of simulated tone color information can be obtained, the number of tone color information is increased, more choices are provided for voice conversion of a speaker, and the flexibility of voice conversion processing of the speaker is improved.
S16, obtaining the audio to be converted and the target audio, calling the content encoder to process the audio to be converted to obtain the target content code, and calling the attribute encoder to process the target audio to obtain the target attribute code.
In at least one embodiment of the present invention, the audio to be converted refers to an audio that needs to be subjected to speaker voice conversion processing, the target audio refers to an audio that includes timbre information of a target speaker, the target audio may be an audio corpus in the training data, or any attribute code obtained by sampling from a normal distribution, or may be timbre information of an invisible speaker, and the invisible speaker may be selected by a user and is not audio information of the training data nor the normal distribution data. The required speaker voice can be obtained by combining the tone information in the target audio and the content code in the audio to be converted.
Optionally, before the invoking the content encoder to process the audio to be converted to obtain a target content encoding, and the invoking the attribute encoder to process the target audio to obtain a target attribute encoding, the method further includes:
acquiring first source speech of the target content code;
acquiring second source speech of the target attribute code;
detecting whether the first source speech and the second source speech are the same;
and when the detection result is that the first source speech is different from the second source speech, inputting the target content code and the target attribute code to the generator.
The first source speech refers to the speaker information coded corresponding to the target content, and the second source speech refers to the speaker information coded corresponding to the target attribute. By detecting whether the first source speech and the second source speech are the same; when the detection result is that the first source speech is the same as the second source speech, the speaker speech conversion processing is not required to be executed; when the detection result is that the first source speech and the second source speech are different, speaker speech conversion processing can be executed, and the target content code and the target attribute code are input to the generator.
In at least one embodiment of the invention, the attribute encoder is called to process the target audio to obtain the target attribute code, and the attribute encoder is directly utilized to extract the information of the speaker, so that the voice conversion of the invisible speaker can be realized, zero learning is realized, and the flexibility of the voice conversion scene of the speaker can be improved.
And S17, inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.
In at least one embodiment of the present invention, the target content code and the target attribute code are input to the generator for conversion processing, so as to obtain the speaker voice.
Optionally, the inputting the target content code and the target attribute code into the generator, and the obtaining the converted speaker voice includes:
convolution processing the target content code to obtain a first convolution code;
performing convolution processing on the target attribute code to obtain a second convolution code;
splicing the first convolutional code and the second convolutional code to obtain a target convolutional code;
and inputting the target convolutional code into the generator to obtain the converted speaker voice.
And performing 1 × 1 convolution operation on the target content code to obtain a first convolution code, and performing 1 × 1 convolution operation on the target attribute code to obtain a second convolution code.
By adopting the method, the voice conversion of the speaker under the many-to-many non-parallel linguistic data is realized by utilizing a counterstudy mode, and the method does not need parallel prediction as training data, so that the difficulty of data acquisition can be greatly reduced; in addition, the invention introduces a content encoder and an attribute encoder, and decomposes the non-tone information and tone information of the training data by using the content encoder and a content discriminator, thereby synthesizing the audio with higher quality.
It is emphasized that the training data may be stored in the nodes of the blockchain in order to further ensure privacy and security of the training data.
Fig. 2 is a block diagram of a speaker voice conversion apparatus based on counterlearning according to a second embodiment of the present invention.
In some embodiments, the competitive learning based speaker speech conversion apparatus 20 may include a plurality of functional modules comprised of computer program segments. The computer program of each program segment in the anti-learning based speaker speech conversion apparatus 20 may be stored in a memory of a computer device and executed by at least one processor to perform (see detailed description of fig. 1) the functions suitable for anti-learning based speaker speech conversion.
In this embodiment, the speaker voice conversion apparatus 20 based on the counterstudy can be divided into a plurality of functional modules according to the functions performed by the speaker voice conversion apparatus. The functional module may include: a preprocessing module 201, a model training module 202, a confrontation calling module 203, a convergence detection module 204, a model determination module 205, an encoding processing module 206, and a speech conversion module 207. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The preprocessing module 201 is configured to acquire and preprocess training data to obtain MFCC features and fundamental frequency features, where the training data includes audio corpora of a plurality of speakers.
In at least one embodiment of the present application, the training data may be non-parallel text data, including audio corpora of several speakers, and the audio corpora of several speakers do not need to have the same text. Illustratively, 30 speakers are selected, each speaker respectively records 400 sentences of audio corpus of different texts, and the audio corpus recorded by the 30 speakers is used as training data.
Optionally, the preprocessing the training data to obtain MFCC features and fundamental frequency features includes:
calling a world vocoder to extract an initial MFCC feature and an initial fundamental frequency feature of the training data;
determining a target fixed length;
and intercepting the initial MFCC features and the initial fundamental frequency features according to the target fixed length to obtain target MFCC features and target fundamental frequency features.
The MFCC is a classic audio feature and is often applied to the fields of speech recognition, audio data classification and the like. The MFCC feature vector includes basic features of 12 to 16 dimensions, one-dimensional energy features, and first-order difference and second-order difference features of the basic features and the energy features, so the MFCC feature vector may have 39 dimensions, 42 dimensions, 45 dimensions, 48 dimensions, and 51 dimensions. In general, when the MFCC feature vector extraction is performed on audio data, a 39-dimensional MFCC feature vector is preferably used. The data processing process for preprocessing the training data to obtain the MFCC features and the fundamental frequency features is the prior art and is not described herein again. The target fixed length is preset and is used for ensuring that training data input to the initial speaker voice conversion model is of a fixed dimension length, and the target fixed length can be set according to actual requirements without limitation.
In one embodiment, the audio data collected in different domains may have noise in itself, and therefore, before the calling world vocoder extracts the initial MFCC feature of the training data, the preprocessing module 201 further includes: and cleaning the training data, denoising and unifying the sampling rate of the audio.
The invention only needs the audio corpus as the training data, does not need to carry out text labeling processing on the audio corpus, and improves the efficiency of the speaker voice conversion processing.
The model training module 202 is configured to input the MFCC features and the fundamental frequency features to an initial speaker speech conversion model for training, where the initial speaker speech conversion model includes a content encoder, an attribute encoder, a content discriminator, a generator, and a domain discriminator.
In at least one embodiment of the present invention, the MFCC features and the fundamental frequency features are input to an initial speaker speech conversion model for training, where the initial speaker speech conversion model includes a content encoder, an attribute encoder, a content discriminator, a generator, and a domain discriminator.
Wherein, the content encoder is configured to extract non-timbre information in the training data to obtain a content code, that is, the content encoder is configured to receive the MFCC feature and the fundamental frequency feature; in one embodiment, the content encoder includes a plurality of CNN layers, each CNN layer being followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 7, and when the number of the CNN layers is 7, the content encoder extracts the content encoding effect best. When the number of CNN layers is less than 7, the loss value of the content encoder becomes high; when the number of CNN layers is greater than 7, the content encoder becomes large, and the time required for each operation becomes long.
The attribute encoder is used for extracting tone color information in training data to obtain attribute codes, and the input of the attribute encoder is the MFCC features and the fundamental frequency features; in one embodiment, the attribute encoder includes a plurality of CNN layers, the activation function is a ReLU function, and each CNN layer is followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 7, and when the number of the CNN layers is 7, the effect of the attribute encoder extracting the attribute encoding is the best. When the number of CNN layers is less than 7, the loss value of the attribute encoder becomes high; when the number of CNN layers is greater than 7, the attribute encoder becomes large, and the time required for each operation becomes long.
The content discriminator is used for receiving content codes and predicting the probability of a speaker corresponding to the content codes, and the higher the probability of the speaker is, the higher the possibility that the content codes are output by the speaker is; the lower the speaker probability, the less likely the content encoding is output by the speaker; in one embodiment, the content discriminator includes several CNN layers, the activation function being a leakage ReLU function, each CNN layer followed by a batch regularization layer, a ReLU activation layer, and a Dropout layer. Illustratively, the number of the CNN layers is 5, and when the number of the CNN layers is 5, the content discriminator may predict the speaker probability corresponding to the content encoding to have the best effect. When the number of CNN layers is less than 5, the loss value of the content discriminator becomes high; when the number of CNN layers is greater than 5, the content discriminator becomes large, and the time required for each operation becomes long.
The generator is used for receiving any content code and attribute code and carrying out speaker voice conversion based on the content code and the attribute code; in one embodiment, the generator includes a number of CNN layers and a number of deconvolution layers. Illustratively, the number of CNN layers is 5, the number of deconvolution layers is 5, and when the number of CNN layers is 5, the effect of the speaker voice conversion performed by the generator is the best. When the number of CNN layers is less than 5, the loss value of the generator becomes high; when the number of CNN layers is greater than 5, the generator becomes large, and the time required for each operation becomes long.
The domain discriminator is used for obtaining the total loss function of the model. In one embodiment, the domain discriminator includes a plurality of CNN layers and a plurality of average pooling layers. Illustratively, the number of the CNN layers is 6, the number of the average pooling layers is 6, and when the number of the CNN layers is 6, the domain arbiter has the best effect of obtaining the total loss function of the model. When the number of CNN layers is less than 6, the loss value of the domain discriminator becomes high; when the number of CNN layers is greater than 6, the domain discriminator becomes large, and the time required for each operation becomes long.
The confrontation calling module 203 is used for calling the confrontation algorithm to train the content encoder and the content discriminator until reaching the nash equilibrium state.
In at least one embodiment of the present invention, in the training process of the initial speaker voice conversion model, in order to decompose the non-timbre information in the training data from the original audio, the content encoder and the content discriminator are trained in a counterlearning manner, the content discriminator is used to distinguish which speaker the received content encoding belongs to, and the content encoder expects that the content discriminator cannot distinguish which speaker the output thereof belongs to, and finally the content encoder and the content discriminator reach a nash equilibrium state. When the nash balance state is reached, the content discriminator cannot judge which speaker the content code belongs to, which means that the content code contains no tone information at all.
Optionally, the invoking the countermeasure algorithm to train the content encoder and the content discriminator until the nash equilibrium state is reached comprises:
acquiring an initial cross entropy loss function corresponding to the content encoder and the content discriminator;
calling a random gradient descent and back propagation algorithm to optimize the cross entropy loss function to obtain a target cross entropy loss function;
detecting whether the target cross entropy loss function is converged;
when the detection result is that the target cross entropy loss function is converged, the content encoder and the content discriminator reach a Nash equilibrium state;
and when the detection result is that the target cross entropy loss function is not converged, the content encoder and the content discriminator do not reach a Nash equilibrium state.
Wherein the initial cross entropy loss function is as follows:
Figure BDA0002880491970000171
wherein E iscAs a content encoder, DcIs a content discriminator, xiIs sampled audio.
For the initial cross-entropy loss function, the content encoder wants to maximize the function and the content discriminator wants to minimize the function. Optimizing the cross entropy loss function through a random gradient descent and back propagation algorithm to make the loss function converge, so that the content encoder and the content discriminator reach a Nash equilibrium state; otherwise, the content encoder and the content discriminator do not reach the Nash equilibrium state.
The invention leads the content encoder and the content discriminator to reach a Nash equilibrium state through model training, at the moment, the output of the content encoder does not contain any speaker information, and the content discriminator can not judge which speaker the content information comes from, thereby realizing the decoupling of the content and the attribute, avoiding the interference of other redundant information in the process of realizing speaker conversion, reducing the model noise and improving the quality of speaker voice conversion.
The convergence detection module 204 is configured to obtain a total loss function of the domain discriminator, and detect whether the total loss function converges.
In at least one embodiment of the present invention, the total loss function of the domain discriminator includes a plurality of sections, and the total loss function can be obtained by weighting and processing the sub-loss functions of the respective sections. Determining whether the model is trained by detecting whether the total loss function is converged, wherein it can be understood that when the detection result is that the total loss function is converged, the model is trained; and when the detection result is that the total loss function is not converged, determining that the model is not trained, and continuing iterative training until the total loss function is converged.
Optionally, the obtaining the total loss function of the domain discriminator includes:
acquiring a target sub-loss function of the domain discriminator, wherein the target sub-loss function comprises a target cross entropy loss function, a target consistency loss function, a target domain discriminator loss function, a target reconstruction loss function, a target KL loss function and a target attribute loss function;
determining a preset weight value of each target sub-loss function;
and weighting and processing the preset weight value and the target sub-loss function to obtain a total loss function of the domain discriminator.
The target cross entropy loss function is as described in the above formula 1, and is not described herein again.
In one embodiment, (x, y) is first defined as any two samples, i.e., two MFCC features and two fundamental features. Defining corresponding speaker artifacts
Figure BDA0002880491970000181
After passing through the content encoder and the attribute encoder, the content encoding of the content encoder and the attribute encoder can be respectively obtained
Figure BDA0002880491970000182
And attribute coding
Figure BDA0002880491970000183
Interleaving the content code and the attribute code into the generator to obtain
Figure BDA0002880491970000184
Then, u and v are cross-fed into the generator to obtain
Figure BDA0002880491970000185
Can be easily seen
Figure BDA0002880491970000186
The content encoding of (a) is from v, the attribute encoding is from u, and the content encoding of v is from x, the attribute encoding of u is from x, thus
Figure BDA0002880491970000187
Both the content coding and the attribute coding of (2) are from x, thus
Figure BDA0002880491970000189
Should be exactly the same as x. In the same way
Figure BDA0002880491970000188
And y should also be identical. Based on the consistency principle, a target consistency loss function is obtained, which is shown as the following formula:
Figure BDA0002880491970000191
wherein G is a generator, EcAs a content encoder, EaBeing attribute encoders, xiAnd yiIs sampled audio.
In one embodiment, the loss function of the target domain discriminator is as follows:
Figure BDA0002880491970000192
wherein D isdomainIs a domain discriminator and G is a generator.
In one embodiment, the target reconstruction loss function is given by:
Figure BDA0002880491970000193
wherein G is a generator, EcAs a content encoder, EaBeing attribute encoders, xiIs sampled audio.
In an embodiment, to define a normal distribution with an attribute code of N (0,1), the target KL penalty function is therefore the KL divergence of the attribute code distribution function and the N (0,1) distribution function.
In one embodiment, one z is randomly sampled from N (0,1) normal distribution as attribute code, and the content of any sample x is selectedThe encoding is fed into the generator as a content encoding together with the speaker encoding of x, then it is obtained
Figure BDA0002880491970000194
Then will be
Figure BDA0002880491970000195
Into the attribute encoder, the result should still be z, so the target attribute loss function should be:
Figure BDA0002880491970000196
wherein G is a generator, EcAs a content encoder, EaBeing attribute encoders, ziIs sampled audio.
After all the target sub-loss functions are determined, a preset weight value of each target sub-loss function is determined, and the preset weight value and the target sub-loss functions are weighted and processed to obtain a total loss function of the domain discriminator. The preset weight value is a value adjusted according to an experimental result, and is not limited herein.
The model determining module 205 is configured to determine a voice conversion model of the target speaker when the detection result is that the total loss function converges.
In at least one embodiment of the present invention, when the detection result is that the total loss function is converged, each model parameter value of the current speaker voice conversion model is determined, and the target speaker voice conversion model is determined based on each model parameter value, and then the target speaker voice conversion model can be called to perform speaker voice conversion.
Optionally, after the determining the voice conversion model of the target speaker, when the attribute encoder is called to extract the tone color information in the training data to obtain the attribute encoding, in order to synthesize the output of the multi-mode, the model determining module 205 further includes:
calling the attribute encoder to extract the tone information in the training data to obtain an attribute encoding set;
and normalizing the attribute coding set to obtain normal distribution corresponding to the attribute coding set.
The data processing process of normalizing the attribute encoder to obtain the normal distribution N (0,1) corresponding to the attribute encoder is the prior art, and is not described herein again. By carrying out normal distribution processing on the attribute coding set, a plurality of pieces of simulated tone color information can be obtained, the number of tone color information is increased, more choices are provided for voice conversion of a speaker, and the flexibility of voice conversion processing of the speaker is improved.
The encoding processing module 206 is configured to obtain an audio to be converted and a target audio, call the content encoder to process the audio to be converted to obtain a target content encoding, and call the attribute encoder to process the target audio to obtain a target attribute encoding.
In at least one embodiment of the present invention, the audio to be converted refers to an audio that needs to be subjected to speaker voice conversion processing, the target audio refers to an audio that includes timbre information of a target speaker, the target audio may be an audio corpus in the training data, or any attribute code obtained by sampling from a normal distribution, or may be timbre information of an invisible speaker, and the invisible speaker may be selected by a user and is not audio information of the training data nor the normal distribution data. The required speaker voice can be obtained by combining the tone information in the target audio and the content code in the audio to be converted.
Optionally, before the invoking the content encoder to process the audio to be converted to obtain the target content encoding, and the invoking the attribute encoder to process the target audio to obtain the target attribute encoding, the encoding processing module 206 further includes:
acquiring first source speech of the target content code;
acquiring second source speech of the target attribute code;
detecting whether the first source speech and the second source speech are the same;
and when the detection result is that the first source speech is different from the second source speech, inputting the target content code and the target attribute code to the generator.
The first source speech refers to the speaker information coded corresponding to the target content, and the second source speech refers to the speaker information coded corresponding to the target attribute. By detecting whether the first source speech and the second source speech are the same; when the detection result is that the first source speech is the same as the second source speech, the speaker speech conversion processing is not required to be executed; when the detection result is that the first source speech and the second source speech are different, speaker speech conversion processing can be executed, and the target content code and the target attribute code are input to the generator.
In at least one embodiment of the invention, the attribute encoder is called to process the target audio to obtain the target attribute code, and the attribute encoder is directly utilized to extract the information of the speaker, so that the voice conversion of the invisible speaker can be realized, zero learning is realized, and the flexibility of the voice conversion scene of the speaker can be improved.
The voice conversion module 207 is configured to input the target content code and the target attribute code to the generator, so as to obtain a converted speaker voice.
In at least one embodiment of the present invention, the target content code and the target attribute code are input to the generator for conversion processing, so as to obtain the speaker voice.
Optionally, the inputting the target content code and the target attribute code into the generator, and the obtaining the converted speaker voice includes:
convolution processing the target content code to obtain a first convolution code;
performing convolution processing on the target attribute code to obtain a second convolution code;
splicing the first convolutional code and the second convolutional code to obtain a target convolutional code;
and inputting the target convolutional code into the generator to obtain the converted speaker voice.
And performing 1 × 1 convolution operation on the target content code to obtain a first convolution code, and performing 1 × 1 convolution operation on the target attribute code to obtain a second convolution code.
By adopting the method, the voice conversion of the speaker under the many-to-many non-parallel linguistic data is realized by utilizing a counterstudy mode, and the method does not need parallel prediction as training data, so that the difficulty of data acquisition can be greatly reduced; in addition, the invention introduces a content encoder and an attribute encoder, and decomposes the non-tone information and tone information of the training data by using the content encoder and a content discriminator, thereby synthesizing the audio with higher quality.
It is emphasized that the training data may be stored in the nodes of the blockchain in order to further ensure privacy and security of the training data.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 does not constitute a limitation of the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 3 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 has stored therein a computer program that, when executed by the at least one processor 32, performs all or a portion of the steps of the counterlearning-based speaker-to-speech conversion method as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or a portion of the steps of the counterlearning-based speaker speech conversion method described in embodiments of the present invention; or implement all or part of the functionality of the speaker-to-speech conversion apparatus based on counterlearning. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or devices in the present invention may also be implemented by one unit or device through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for speaker voice conversion based on antagonistic learning, the method comprising:
acquiring and preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics, wherein the training data comprises audio corpora of a plurality of speakers;
inputting the MFCC features and the fundamental frequency features to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator;
invoking a countermeasure algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached;
acquiring a total loss function of the domain discriminator, and detecting whether the total loss function is converged;
when the detection result is that the total loss function is converged, determining a target speaker voice conversion model;
acquiring audio to be converted and target audio, calling the content encoder to process the audio to be converted to obtain target content codes, and calling the attribute encoder to process the target audio to obtain target attribute codes;
and inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.
2. The method of claim 1, wherein preprocessing the training data to obtain MFCC features and fundamental frequency features comprises:
calling a world vocoder to extract an initial MFCC feature and an initial fundamental frequency feature of the training data;
determining a target fixed length;
and intercepting the initial MFCC features and the initial fundamental frequency features according to the target fixed length to obtain target MFCC features and target fundamental frequency features.
3. The method of warrior-based speech conversion according to claim 1, wherein said invoking a warrior algorithm to train the content encoder and the content discriminator until a nash-balanced state is reached comprises:
acquiring an initial cross entropy loss function corresponding to the content encoder and the content discriminator;
calling a random gradient descent and back propagation algorithm to optimize the cross entropy loss function to obtain a target cross entropy loss function;
detecting whether the target cross entropy loss function is converged;
when the detection result is that the target cross entropy loss function is converged, the content encoder and the content discriminator reach a Nash equilibrium state;
and when the detection result is that the target cross entropy loss function is not converged, the content encoder and the content discriminator do not reach a Nash equilibrium state.
4. The method of claim 1, wherein obtaining the overall loss function of the domain discriminator comprises:
acquiring a target sub-loss function of the domain discriminator, wherein the target sub-loss function comprises a target cross entropy loss function, a target consistency loss function, a target domain discriminator loss function, a target reconstruction loss function, a target KL loss function and a target attribute loss function;
determining a preset weight value of each target sub-loss function;
and weighting and processing the preset weight value and the target sub-loss function to obtain a total loss function of the domain discriminator.
5. The method for discourse learning based speaker speech conversion according to claim 1, wherein after the determining the target speaker speech conversion model, the method further comprises:
calling the attribute encoder to extract the tone information in the training data to obtain an attribute encoding set;
and normalizing the attribute coding set to obtain normal distribution corresponding to the attribute coding set.
6. The method as claimed in claim 1, wherein before the invoking the content encoder to process the audio to be converted to obtain a target content encoding and the invoking the property encoder to process the target audio to obtain a target property encoding, the method further comprises:
acquiring first source speech of the target content code;
acquiring second source speech of the target attribute code;
detecting whether the first source speech and the second source speech are the same;
and when the detection result is that the first source speech is different from the second source speech, inputting the target content code and the target attribute code to the generator.
7. The method of claim 1, wherein inputting the target content encoding and the target attribute encoding to the generator results in a converted speaker speech comprises:
convolution processing the target content code to obtain a first convolution code;
performing convolution processing on the target attribute code to obtain a second convolution code;
splicing the first convolutional code and the second convolutional code to obtain a target convolutional code;
and inputting the target convolutional code into the generator to obtain the converted speaker voice.
8. A speaker voice conversion apparatus based on counterstudy, the apparatus comprising:
the preprocessing module is used for acquiring and preprocessing training data to obtain MFCC characteristics and fundamental frequency characteristics, wherein the training data comprises audio corpora of a plurality of speakers;
the model training module is used for inputting the MFCC characteristics and the fundamental frequency characteristics to an initial speaker voice conversion model for training, wherein the initial speaker voice conversion model comprises a content encoder, an attribute encoder, a content discriminator, a generator and a domain discriminator;
the confrontation calling module is used for calling a confrontation algorithm to train the content encoder and the content discriminator until a Nash equilibrium state is reached;
a convergence detection module, configured to obtain a total loss function of the domain discriminator and detect whether the total loss function converges;
the model determining module is used for determining a target speaker voice conversion model when the detection result is that the total loss function is converged;
the coding processing module is used for acquiring audio to be converted and target audio, calling the content coder to process the audio to be converted to obtain target content coding, and calling the attribute coder to process the target audio to obtain target attribute coding;
and the voice conversion module is used for inputting the target content code and the target attribute code to the generator to obtain the converted speaker voice.
9. A computer device comprising a processor for implementing the counterlearning-based speaker speech conversion method according to any one of claims 1 to 7 when executing a computer program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the counterlearning-based speaker speech conversion method according to any one of claims 1 to 7.
CN202011632876.XA 2020-12-31 2020-12-31 Speaker voice conversion method based on countermeasure learning and related equipment Active CN112863529B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011632876.XA CN112863529B (en) 2020-12-31 2020-12-31 Speaker voice conversion method based on countermeasure learning and related equipment
PCT/CN2021/096887 WO2022142115A1 (en) 2020-12-31 2021-05-28 Adversarial learning-based speaker voice conversion method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011632876.XA CN112863529B (en) 2020-12-31 2020-12-31 Speaker voice conversion method based on countermeasure learning and related equipment

Publications (2)

Publication Number Publication Date
CN112863529A true CN112863529A (en) 2021-05-28
CN112863529B CN112863529B (en) 2023-09-22

Family

ID=75999980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011632876.XA Active CN112863529B (en) 2020-12-31 2020-12-31 Speaker voice conversion method based on countermeasure learning and related equipment

Country Status (2)

Country Link
CN (1) CN112863529B (en)
WO (1) WO2022142115A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345453A (en) * 2021-06-01 2021-09-03 平安科技(深圳)有限公司 Singing voice conversion method, device, equipment and storage medium
CN113870876A (en) * 2021-09-27 2021-12-31 平安科技(深圳)有限公司 Singing voice conversion method and device based on self-supervision model and readable storage medium
CN115064177A (en) * 2022-06-14 2022-09-16 中国第一汽车股份有限公司 Voice conversion method, apparatus, device and medium based on voiceprint encoder
CN115222752A (en) * 2022-09-19 2022-10-21 之江实验室 Pathological image feature extractor training method and device based on feature decoupling

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620748B (en) * 2022-12-06 2023-03-28 北京远鉴信息技术有限公司 Comprehensive training method and device for speech synthesis and false identification evaluation

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107293289A (en) * 2017-06-13 2017-10-24 南京医科大学 A kind of speech production method that confrontation network is generated based on depth convolution
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN110060657A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on SN
CN110600046A (en) * 2019-09-17 2019-12-20 南京邮电大学 Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN111247585A (en) * 2019-12-27 2020-06-05 深圳市优必选科技股份有限公司 Voice conversion method, device, equipment and storage medium
KR20200063331A (en) * 2018-11-21 2020-06-05 고려대학교 산학협력단 Multiple speaker voice conversion using conditional cycle GAN
CN111243569A (en) * 2020-02-24 2020-06-05 浙江工业大学 Emotional voice automatic generation method and device based on generation type confrontation network
CN111429894A (en) * 2020-03-12 2020-07-17 南京邮电大学 Many-to-many speaker conversion method based on SE-ResNet STARGAN
CN111429893A (en) * 2020-03-12 2020-07-17 南京邮电大学 Many-to-many speaker conversion method based on Transitive STARGAN
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling
CN112037760A (en) * 2020-08-24 2020-12-04 北京百度网讯科技有限公司 Training method and device of voice spectrum generation model and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10186251B1 (en) * 2015-08-06 2019-01-22 Oben, Inc. Voice conversion using deep neural network with intermediate voice training
US10347241B1 (en) * 2018-03-23 2019-07-09 Microsoft Technology Licensing, Llc Speaker-invariant training via adversarial learning
CN110060691B (en) * 2019-04-16 2023-02-28 南京邮电大学 Many-to-many voice conversion method based on i-vector and VARSGAN
CN111161744B (en) * 2019-12-06 2023-04-28 华南理工大学 Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation
CN111564160B (en) * 2020-04-21 2022-10-18 重庆邮电大学 Voice noise reduction method based on AEWGAN

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107293289A (en) * 2017-06-13 2017-10-24 南京医科大学 A kind of speech production method that confrontation network is generated based on depth convolution
KR20200063331A (en) * 2018-11-21 2020-06-05 고려대학교 산학협력단 Multiple speaker voice conversion using conditional cycle GAN
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN110060657A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on SN
CN110600046A (en) * 2019-09-17 2019-12-20 南京邮电大学 Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN111247585A (en) * 2019-12-27 2020-06-05 深圳市优必选科技股份有限公司 Voice conversion method, device, equipment and storage medium
CN111243569A (en) * 2020-02-24 2020-06-05 浙江工业大学 Emotional voice automatic generation method and device based on generation type confrontation network
CN111429894A (en) * 2020-03-12 2020-07-17 南京邮电大学 Many-to-many speaker conversion method based on SE-ResNet STARGAN
CN111429893A (en) * 2020-03-12 2020-07-17 南京邮电大学 Many-to-many speaker conversion method based on Transitive STARGAN
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling
CN112037760A (en) * 2020-08-24 2020-12-04 北京百度网讯科技有限公司 Training method and device of voice spectrum generation model and electronic equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345453A (en) * 2021-06-01 2021-09-03 平安科技(深圳)有限公司 Singing voice conversion method, device, equipment and storage medium
CN113345453B (en) * 2021-06-01 2023-06-16 平安科技(深圳)有限公司 Singing voice conversion method, device, equipment and storage medium
CN113870876A (en) * 2021-09-27 2021-12-31 平安科技(深圳)有限公司 Singing voice conversion method and device based on self-supervision model and readable storage medium
CN113870876B (en) * 2021-09-27 2024-06-25 平安科技(深圳)有限公司 Singing voice conversion method, device and readable storage medium based on self-supervision model
CN115064177A (en) * 2022-06-14 2022-09-16 中国第一汽车股份有限公司 Voice conversion method, apparatus, device and medium based on voiceprint encoder
CN115222752A (en) * 2022-09-19 2022-10-21 之江实验室 Pathological image feature extractor training method and device based on feature decoupling
CN115222752B (en) * 2022-09-19 2023-01-24 之江实验室 Pathological image feature extractor training method and device based on feature decoupling

Also Published As

Publication number Publication date
WO2022142115A1 (en) 2022-07-07
CN112863529B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN112863529B (en) Speaker voice conversion method based on countermeasure learning and related equipment
CN110070852B (en) Method, device, equipment and storage medium for synthesizing Chinese voice
CN107220235A (en) Speech recognition error correction method, device and storage medium based on artificial intelligence
CN112086086A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN112185348A (en) Multilingual voice recognition method and device and electronic equipment
CN112906385B (en) Text abstract generation method, computer equipment and storage medium
CN113436634B (en) Voice classification method and device based on voiceprint recognition and related equipment
CN112951203B (en) Speech synthesis method, device, electronic equipment and storage medium
CN113380223B (en) Method, device, system and storage medium for disambiguating polyphone
CN112820269A (en) Text-to-speech method, device, electronic equipment and storage medium
CN113420556A (en) Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN113450765A (en) Speech synthesis method, apparatus, device and storage medium
CN115688937A (en) Model training method and device
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116050352A (en) Text encoding method and device, computer equipment and storage medium
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
CN114420168A (en) Emotion recognition method, device, equipment and storage medium
CN112489628B (en) Voice data selection method and device, electronic equipment and storage medium
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
CN113704410A (en) Emotion fluctuation detection method and device, electronic equipment and storage medium
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
CN114218356B (en) Semantic recognition method, device, equipment and storage medium based on artificial intelligence
CN113436617B (en) Voice sentence breaking method, device, computer equipment and storage medium
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN113221990A (en) Information input method and device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant