CN116959465A - Voice conversion model training method, voice conversion method, device and medium - Google Patents

Voice conversion model training method, voice conversion method, device and medium Download PDF

Info

Publication number
CN116959465A
CN116959465A CN202310688583.0A CN202310688583A CN116959465A CN 116959465 A CN116959465 A CN 116959465A CN 202310688583 A CN202310688583 A CN 202310688583A CN 116959465 A CN116959465 A CN 116959465A
Authority
CN
China
Prior art keywords
sample
voice
speech
conversion model
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310688583.0A
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310688583.0A priority Critical patent/CN116959465A/en
Publication of CN116959465A publication Critical patent/CN116959465A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The application relates to the technical field of voice conversion, and provides a voice conversion model training method, a voice conversion device and a medium, wherein the method comprises the following steps: and extracting voice sample characteristics from the preset voice samples through an encoder. And then decoupling the voice sample based on a preset mask strategy, inputting the sample characteristic representation into a generator, training the generator to reconstruct a Mel spectrogram of the voice sample according to the sample characteristic representation, obtaining a Mel spectrogram of a target sample, and calculating the voice reconstruction loss of the voice conversion model according to the Mel spectrogram of the target sample and an original sample Mel spectrogram corresponding to the preset voice sample. And optimizing parameters in the voice conversion model based on the countermeasures and the voice reconstruction losses to obtain a trained voice conversion model. The voice sample characteristics are decoupled through the preset mask strategy and the preset countermeasure network, so that the robustness of the voice conversion model is improved, and the training efficiency is further improved.

Description

Voice conversion model training method, voice conversion method, device and medium
Technical Field
The present application relates to the field of speech conversion technologies, and in particular, to a speech conversion model training method, a speech conversion model training device, a speech conversion device, a computer device, and a storage medium.
Background
The voice conversion includes changing the source speaker's voice to sound like the target speaker's voice while maintaining the language information unchanged.
In the training process of the existing voice conversion model, the voice characteristics are disentangled by a disentanglement (decoupling) algorithm adopted by the voice conversion model, such as random resampling and temporary bottleneck layer size, but the method is difficult to ensure robust voice characteristic decoupling, so that the whole training process is influenced, and the training efficiency of the voice conversion model is low.
Disclosure of Invention
The embodiment of the application provides a voice conversion model training method, which aims to solve the problem that the existing voice conversion model training scheme has lower training efficiency.
A first aspect of an embodiment of the present application provides a method for training a speech conversion model, including:
extracting voice sample characteristics from a preset voice sample through an encoder; the voice sample features comprise sample content features, sample tone features, sample rhythm features and sample pitch features;
Decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature;
inputting the sample characteristic representation into a generator, and training the generator to reconstruct a speech sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram;
calculating voice reconstruction loss according to the original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset voice sample;
and optimizing parameters in the voice conversion model based on the pair loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
A second aspect of an embodiment of the present application provides a voice conversion method, including:
extracting voice information of a source speaker and a target speaker; the voice information comprises voice content information, tone information, rhythm information and pitch information
Inputting the voice information into a trained voice conversion model for voice conversion to obtain a target Mel spectrogram; the trained voice conversion model is obtained by training by adopting the voice conversion model training method;
And converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
A third aspect of an embodiment of the present application provides a speech conversion model training apparatus, including:
and an extraction module: the voice sample characteristic is extracted from a preset voice sample through an encoder; the voice sample features comprise sample content features, sample tone features, sample rhythm features and sample pitch features;
and a decoupling module: the method comprises the steps of decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature;
and a reconstruction module: the method comprises the steps of inputting the sample characteristic representation into a generator, and training the generator to reconstruct a voice sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram;
the calculation module: the method comprises the steps of calculating voice reconstruction loss according to an original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset voice sample;
training module: and the method is used for optimizing parameters in the voice conversion model based on the pair-loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
A fourth aspect of an embodiment of the present application provides a voice conversion apparatus, including:
and an extraction module: the voice information of the source speaker and the target speaker is extracted; the voice information comprises voice content information, tone information, rhythm information and pitch information
A first conversion module: the method comprises the steps of inputting voice information into a trained voice conversion model to perform voice conversion to obtain a target Mel spectrogram; the trained voice conversion model is obtained by training by adopting the voice conversion model training method;
and a second conversion module: and the method is used for converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
A fifth aspect of an embodiment of the present application provides a computer device, including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, where the processor implements the above-described speech conversion model training method when executing the computer readable instructions, or the processor implements the above-described speech conversion method when executing the computer readable instructions.
A sixth aspect of embodiments of the present application provides one or more readable storage media storing computer-readable instructions that, when executed by one or more processors, implement a speech conversion model training method as described above, or that, when executed by one or more processors, implement a speech conversion method as described above.
The embodiment of the application provides a voice conversion model training method, which is characterized in that voice sample characteristics are extracted from preset voice samples through an encoder, wherein the voice sample characteristics comprise sample content characteristics, sample tone characteristics, sample rhythm characteristics and sample pitch characteristics. And then decoupling the voice sample based on a preset mask strategy, performing voice enhancement on voice sample characteristics to obtain sample characteristic representation, and reducing distortion of the voice sample characteristics as much as possible by calculating the countermeasures in the decoupling process to obtain more accurate sample characteristic representation so as to solve the problem that characteristic mismatch exists after the sample characteristic representation is input into a generator to influence the robustness of voice conversion model training as much as possible. Inputting the decoupled sample characteristic representation to a generator, training the generator to reconstruct a voice sample Mel spectrogram according to the sample characteristic representation, obtaining a target sample Mel spectrogram, and calculating voice reconstruction loss of a voice conversion model according to the target sample Mel spectrogram and an original sample Mel spectrogram corresponding to a preset voice sample. And optimizing parameters in the voice conversion model based on the countermeasures and the voice reconstruction losses to obtain a trained voice conversion model. The voice sample characteristics are decoupled through the preset mask strategy and the preset countermeasure network, so that the distortion of the voice sample characteristics is reduced, the robustness of training the voice conversion model is improved, and the training efficiency is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application environment of a speech conversion model training method or a speech conversion method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an implementation flow of a speech conversion model training method in an embodiment of the present application;
FIG. 3 is an exemplary diagram of a speech conversion model for a speech conversion model training method in accordance with an embodiment of the present application;
FIG. 4 is a diagram of an exemplary decoupling network of a speech conversion model training method in accordance with an embodiment of the present application;
FIG. 5 is a schematic diagram of a voice conversion method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a training device for speech conversion model according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a voice conversion device according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a computer device in accordance with an embodiment of the application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, fig. 1 shows an application environment schematic diagram of a speech conversion model training method in an embodiment of the present application, as shown in fig. 1, the training and speech conversion of a speech conversion model may be performed by a server through inputting and uploading a preset speech sample or speech information of a source speaker, a target speaker, etc. through a user terminal, or the training and speech conversion of a speech conversion model may be performed by a user terminal including a processor, a computer storage medium, etc. User terminals include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. User terminals of different service systems may interact with a server at the same time or with a particular server in a cluster of servers.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Referring to fig. 2, fig. 2 is a flowchart showing an implementation of a speech conversion model training method according to an embodiment of the present application, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
the voice conversion model in the embodiment of the application comprises the following steps: the system comprises an encoder, a decoupling network and a generator, wherein the decoupling network comprises a preset mask strategy and a preset countermeasure network.
S11: extracting voice sample characteristics from a preset voice sample through an encoder; the speech sample features include sample content features, sample tone features, sample cadence features, and sample pitch features.
In step S11, the encoder includes, but is not limited to, a content encoder, a tone encoder, a rhythm encoder, a pitch encoder.
In this embodiment, the content encoder is configured to recognize input speech and perform text conversion to obtain speech content information unrelated to the speaker, and ASR (Automatic Speech Recognition, automatic speech recognition technology) or the like may be used as the content encoder. The timbre encoder is used for extracting timbre characteristics, and the input of the timbre characteristics is a timbre vector. The input of the rhythm encoder is voice containing a large amount of information, so that various disordered information is likely to be encoded into the rhythm characteristics, and the input voice can be preprocessed before the rhythm characteristics are extracted, so that the information irrelevant to the rhythm is filtered, and the accuracy of the rhythm characteristics extraction is improved. The input to the pitch encoder is pitch contour information, or fundamental frequency information, of the speech.
As an example, referring to fig. 3, fig. 3 is a diagram illustrating an example of a voice conversion model training method provided by an embodiment of the present application, as shown in fig. 1, a tone-hot (one-hot) encoding is used to obtain a tone color vector of a preset voice sample, then a tone color variable is input into a tone color encoder, deep learning is performed on the input tone color vector to obtain a sample tone color feature, and the tone color encoder may use word-embedding encoding to realize deep learning on the input tone color vector. The single thermal coding is to generate one-hot labels of different preset voice samples through coding of 0-1 according to the number of voice samples in the training corpus, for example, three preset voice samples 1, 2 and 3 are included, the one-hot label of the first preset voice sample is [100], the one-hot label of the second preset voice sample is [010], the one-hot label of the third preset voice sample is [001], and the one-hot label corresponding to the preset voice sample is used as a tone vector of the preset voice sample to be input into a tone encoder. Furthermore, before inputting the speech samples into the content encoder and the pitch encoder, the input speech samples and the pitch contour need to be randomly sampled in advance to improve the accuracy of the sample features during training.
S12: decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature.
In step S12, the decoupling network includes a preset masking policy and a preset countermeasure network, and the decoupling refers to separating the voice sample features through the countermeasure training learning, so as to enhance the voice sample features.
In this embodiment, a preset masking policy and a preset countermeasure network are configured to separate the voice sample features through countermeasure training learning, so as to enhance the voice sample features. The prior decoupling realized by adopting random resampling and temporary bottleneck layer size adjustment is avoided, the robustness of the decoupling is difficult to ensure, and the robustness of the training of the voice conversion model is further influenced.
As an embodiment of the present application, the preset countermeasure network includes a prediction layer and a gradient inversion layer; the decoupling of the voice sample features based on the preset mask strategy and the preset countermeasure network to obtain sample feature representation, and calculating the countermeasure loss in the decoupling process comprises the following steps: generating a random mask based on the preset mask policy; the random mask is used for randomly masking one sample characteristic of the sample content characteristic, the sample tone characteristic, the sample rhythm characteristic and the sample pitch characteristic, so that the prediction layer predicts the masked sample characteristic based on three other sample characteristics except the masked sample characteristic; calculating the challenge loss based on the random mask and the speech sample feature; and decoupling the voice sample characteristics based on the gradient inversion layer and the countermeasures to obtain sample characteristic representation.
In this embodiment, a masked sample feature is represented by 0 and an unmasked sample feature is represented by 1, and according to the above rule, the random mask includes (0, 1), (1, 0, 1), (1, 0,1, 0), and one sample feature of the sample content feature, the sample tone feature, the sample rhythm feature, and the sample pitch feature encoded by the encoder is randomly masked by the random mask, and the masked sample feature is predicted by a prediction layer in the antagonism network based on three other sample features other than the masked sample feature. The challenge loss is then calculated from the random mask and the speech sample features, and the challenge loss is counter-propagated to the encoder through the gradient inversion layer, encouraging the speech sample features learned by the encoder to contain as little mutual information as possible.
As an example, referring to fig. 4, fig. 4 is a diagram illustrating an example of a decoupling network of a speech conversion model training method according to an embodiment of the present application. The prediction layers of the antagonism network include a fully connected layer, an activation function, layer normalization, and another fully connected layer. The gradient of the antagonizing network is reversed by the gradient reversing layer before propagating back to the encoder, encouraging the encoder to learn speech sample features that contain as little mutual information as possible. And through a decoupling network of random mask prediction, the characteristics of the voice sample are separated, and the robustness of multi-factor highly controllable style migration in the training process of the voice conversion model is improved.
As an embodiment of the present application, the calculating the challenge loss based on the random mask and the speech sample feature includes:
the challenge loss is calculated according to the following formula:
L adv =||(1-M)·(Z-MAP(M·Z)||,
wherein z= (Z r ,Z c ,Z f ,Z u ),M∈(0,1,1,1),(1,0,1,1),(1,1,0,1),(1,1,1,0);
Wherein L is adv Refers to countering the loss; m refers to a random mask; z is Z r Refers to the rhythm characteristic of the sample, Z c Refers to sample content features, Z f Refers to the pitch characteristics of the sample, Z u Refers to sample tone characteristics; z is Z r 、Z c 、Z f 、Z u Is a concatenation of vectors; MAP refers to mean average accuracy. Note that MAP is an abbreviation of Mean Average Precision, i.e., average accuracy. As an index in object detection for measuring detection accuracy. The calculation formula is as follows: MAP = average precision sum of all classes divided by all classes.
S13: inputting the sample characteristic representation into a generator, and training the generator to reconstruct a speech sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram.
In step S13, the sample characteristic representation exists in the form of a vector, including but not limited to a sample content representation, a sample tone representation, a sample tempo representation, a sample pitch representation.
In this embodiment, a sample content representation, a sample tone representation, a sample rhythm representation, and a sample pitch representation obtained by decoupling are extracted, each sample feature representation is input into a generator to perform feature fusion to obtain a fusion vector, and the fusion vector is decoded according to the characteristics of the mel frequency spectrum coefficients to obtain a mel frequency spectrum diagram of the target sample. It should be noted that, the sample content representation, the sample tone representation, the sample rhythm representation, and the sample pitch representation may be the same dimension representation, or may be vector representations of different dimensions, and by performing feature fusion on the sample content representation, the sample tone representation, the sample rhythm representation, and the sample pitch representation, a vector of a higher dimension may be obtained, for example, a vector of 128 dimensions for the sample content representation, a vector of 64 dimensions for the sample tone representation, a vector of 32 dimensions for the sample rhythm representation, and a vector of 32 dimensions for the sample pitch representation, and a fusion vector of 512 dimensions may be obtained by feature fusion.
S14: and calculating the speech reconstruction loss according to the original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset speech sample.
In step S14, the original sample mel-frequency spectrum is obtained by a mel-frequency spectrum filter according to the original voice sample characteristics of the input preset voice sample.
In this embodiment, since the encoder and the feature of the speech sample after the countermeasure training are changed in the speech conversion model, there is also a difference between the target sample mel-frequency spectrogram synthesized by the speech conversion model and the original sample mel-frequency spectrogram, and the difference is represented by a speech reconstruction loss.
As an embodiment of the present application, the calculating a speech reconstruction loss according to the original sample mel spectrogram corresponding to the target sample mel spectrogram and the preset speech sample includes:
the speech reconstruction loss is calculated as follows:
wherein L is recon Refers to speech reconstruction loss; s refers to the original sample Mel spectrogram;and the target sample Mel spectrogram is referred.
S15: and optimizing parameters in the voice conversion model based on the pair loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
In step S15, the countermeasures against loss are errors generated during the decoupling of the speech sample feature, which represents the speech sample feature. The speech reconstruction loss refers to an error between a target sample mel spectrum generated by the generator based on the sample feature representation and an original sample mel spectrum of a corresponding input preset speech sample. And constructing a model loss function of the voice conversion model according to the antagonism loss and the voice reconstruction loss.
In this embodiment, weights are respectively assigned to the countermeasures and the speech reconstruction losses, the speech conversion model is trained based on the countermeasures and the speech reconstruction losses, parameters in the speech conversion model are optimized, and weight values of the countermeasures and the speech reconstruction losses are adjusted, so that values of a model loss function can meet model convergence conditions, and a trained speech conversion model is obtained.
As an embodiment of the present application, the optimizing parameters in the speech conversion model based on the pair of loss-resistance and the speech reconstruction loss to obtain a trained speech conversion model includes:
model loss was calculated as follows:
L=α*L adv +β*L recon
wherein L refers to model loss, alpha refers to weight against loss, beta refers to weight of speech reconstruction loss, and the value ranges of alpha and beta are 0, 1;
And when the model loss reaches a preset convergence condition, the voice conversion model converges to obtain a trained voice conversion model. The preset convergence condition can be a specific value or a range of values, and the size or the value of the convergence condition can be customized, so that the model loss is as small as possible, and the accuracy of data output of the voice conversion model is improved.
The embodiment of the application provides a voice conversion model training method, wherein a voice conversion model comprises an encoder, a decoupling network and a generator, the decoupling network comprises a preset mask strategy and a preset countermeasure network, and voice sample characteristics are extracted from preset voice samples through the encoder, wherein the voice sample characteristics comprise sample content characteristics, sample tone characteristics, sample rhythm characteristics and sample pitch characteristics. And then decoupling the voice sample based on a preset mask strategy, performing voice enhancement on voice sample characteristics to obtain sample characteristic representation, and reducing distortion of the voice sample characteristics as much as possible by calculating the countermeasures in the decoupling process to obtain more accurate sample characteristic representation so as to solve the problem that characteristic mismatch exists after the sample characteristic representation is input into a generator to influence the robustness of voice conversion model training as much as possible. Inputting the decoupled sample characteristic representation to a generator, training the generator to reconstruct a voice sample Mel spectrogram according to the sample characteristic representation, obtaining a target sample Mel spectrogram, and calculating voice reconstruction loss of a voice conversion model according to the target sample Mel spectrogram and an original sample Mel spectrogram corresponding to a preset voice sample. And optimizing parameters in the voice conversion model based on the countermeasures and the voice reconstruction losses to obtain a trained voice conversion model. The voice sample characteristics are decoupled through the preset mask strategy and the preset countermeasure network, so that the distortion of the voice sample characteristics is reduced, the robustness of training the voice conversion model is improved, and the training efficiency is further improved.
Referring to fig. 5, fig. 5 is a flowchart showing an implementation of a voice conversion method according to an embodiment of the present application, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
s21: extracting voice information of a source speaker and a target speaker; the voice information includes voice content information, tone information, rhythm information, and pitch information.
In step S21, the voice information of the source speaker, that is, the voice information to be converted. When the voice of the source speaker needs to be converted, the voice of the source speaker is used as the voice to be converted.
In this embodiment, before the conversion of the voice of the source speaker, the voice of the source speaker and the voice information of the target speaker need to be obtained, and specifically, the whole audio or part of the audio may be extracted from the video file, the audio file, or the like as the voice of the source speaker or the voice information of the target speaker. Wherein the voice information includes, but is not limited to, voice content information, tone information, rhythm information, and pitch information.
S22: inputting the voice information into a trained voice conversion model for voice conversion to obtain a target Mel spectrogram; the trained voice conversion model is obtained by training by adopting the voice conversion model training method.
In step S22, the target mel-frequency spectrogram is a new mel-frequency spectrogram of the voice obtained after the voice conversion by the trained voice conversion model.
In this embodiment, inputting the voice information into the trained voice conversion model to perform voice conversion, and obtaining the target mel spectrogram includes: inputting the voice information of the source speaker to a content encoder of the trained voice conversion model to extract content characteristics irrelevant to the speaker; respectively inputting the voice information of the target speaker into a tone encoder, a rhythm encoder and a pitch encoder of the trained voice conversion model to extract tone characteristics, rhythm characteristics and pitch characteristics of the target speaker; and generating a target Mel spectrogram based on the content characteristics and the tone characteristics, rhythm characteristics and pitch characteristics of the target speaker through a trained voice conversion model. As an implementation manner, before inputting the voice information of the target speaker to the timbre encoder, the rhythm encoder and the pitch encoder of the trained voice conversion model, the voice information of the target speaker may be preprocessed, for example, a pitch contour of the voice information of the target speaker is extracted, and the pitch contour is randomly sampled and then input to the pitch encoder.
S23: and converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
In step S23, the preset algorithm includes, but is not limited to, griffin_lim algorithm.
In this embodiment, the griffin_lim algorithm is as follows: randomly initializing a phase spectrum, synthesizing a new voice waveform by using the phase spectrum and a known target Mel spectrum through inverse Fourier transform, performing short-time Fourier transform on the synthesized voice to obtain a new amplitude spectrum and a new phase spectrum, and synthesizing the new voice by using the known target Mel spectrum and the new phase spectrum through inverse Fourier transform, so repeating for a plurality of times until the synthesized voice achieves a satisfactory effect.
The embodiment provides a voice conversion method, which increases the conversion of the rhythm and pitch characteristics of a target speaker in the voice conversion process, so that the rhythm of a source speaker and the rhythm of the target speaker after voice conversion are kept consistent, and the voice conversion effect is improved. And based on a decoupling voice representation network for resistance learning in a trained voice conversion model, extracting the content representation of the voice of the source speaker and the tone, rhythm and pitch representation of the target speaker, thereby improving the robustness of multi-factor highly controllable style migration in the voice conversion process.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
In one embodiment, a speech conversion model training apparatus 600 is provided, which corresponds to the speech conversion model training method in the above embodiment one by one. As shown in fig. 6, the speech conversion model training apparatus includes an extraction module 601, a decoupling module 602, a reconstruction module 603, a calculation module 604, and a training module 605. The functional modules are described in detail as follows:
extraction module 601: the voice sample characteristic is extracted from a preset voice sample through an encoder; the voice sample features comprise sample content features, sample tone features, sample rhythm features and sample pitch features;
decoupling module 602: the method comprises the steps of decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature;
Reconstruction module 603: the method comprises the steps of inputting the sample characteristic representation into a generator, and training the generator to reconstruct a voice sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram;
the calculation module 604: the method comprises the steps of calculating voice reconstruction loss according to an original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset voice sample;
training module 605: and the method is used for optimizing parameters in the voice conversion model based on the pair-loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
In one embodiment, there is also provided a voice conversion apparatus 700, which corresponds to the voice conversion method in the above embodiment one by one. As shown in fig. 7, the voice conversion apparatus includes an extraction module 701, a first conversion module 702, and a second conversion module 703. The functional modules are described in detail as follows:
extraction module 701: the voice information of the source speaker and the target speaker is extracted; the voice information comprises voice content information, tone information, rhythm information and pitch information
The first conversion module 702: the method comprises the steps of inputting voice information into a trained voice conversion model to perform voice conversion to obtain a target Mel spectrogram; the trained voice conversion model is obtained by training by adopting the voice conversion model training method;
The second conversion module 703: and the method is used for converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
The specific limitation of the speech conversion model training device can be referred to the limitation of the speech conversion model training method hereinabove, and the specific limitation of the speech conversion device can be referred to the limitation of the speech conversion method hereinabove, and will not be described herein. The above-mentioned speech conversion model training apparatus and each module in the speech conversion apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the execution of an operating system and computer-readable instructions in a readable storage medium. The database of the computer device is used for storing data related to the speech conversion model training method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions when executed by a processor implement a speech conversion model training method. The readable storage medium provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the execution of an operating system and computer-readable instructions in a readable storage medium. The network interface of the computer device is for communicating with an external server via a network connection. The computer readable instructions when executed by a processor implement a speech conversion model training method. The readable storage medium provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.
In one embodiment, a computer device is provided that includes a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, the processor implementing, when executing the computer readable instructions:
A speech conversion model training method, comprising:
extracting voice sample characteristics from a preset voice sample through an encoder; the voice sample features comprise sample content features, sample tone features, sample rhythm features and sample pitch features;
decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature;
inputting the sample characteristic representation into a generator, and training the generator to reconstruct a speech sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram;
calculating voice reconstruction loss according to the original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset voice sample;
and optimizing parameters in the voice conversion model based on the pair loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
And a voice conversion method, comprising:
extracting voice information of a source speaker and a target speaker; the voice information comprises voice content information, tone information, rhythm information and pitch information
Inputting the voice information into a trained voice conversion model for voice conversion to obtain a target Mel spectrogram; the trained voice conversion model is obtained by training by adopting the voice conversion model training method;
and converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
In one embodiment, one or more computer-readable storage media are provided having computer-readable instructions stored thereon, the readable storage media provided by the present embodiment including non-volatile readable storage media and volatile readable storage media. The readable storage medium has stored thereon computer readable instructions that when executed by one or more processors implement:
a speech conversion model training method, comprising:
extracting voice sample characteristics from a preset voice sample through an encoder; the voice sample features comprise sample content features, sample tone features, sample rhythm features and sample pitch features;
decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature;
Inputting the sample characteristic representation into a generator, and training the generator to reconstruct a speech sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram;
calculating voice reconstruction loss according to the original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset voice sample;
and optimizing parameters in the voice conversion model based on the pair loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
And a voice conversion method, comprising:
extracting voice information of a source speaker and a target speaker; the voice information comprises voice content information, tone information, rhythm information and pitch information
Inputting the voice information into a trained voice conversion model for voice conversion to obtain a target Mel spectrogram; the trained voice conversion model is obtained by training by adopting the voice conversion model training method;
and converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by instructing the associated hardware by computer readable instructions stored on a non-volatile readable storage medium or a volatile readable storage medium, which when executed may comprise the above described embodiment methods. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. A speech conversion model training method, characterized in that the speech conversion model training method comprises:
extracting voice sample characteristics from a preset voice sample through an encoder; the voice sample features comprise sample content features, sample tone features, sample rhythm features and sample pitch features;
Decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature;
inputting the sample characteristic representation into a generator, and training the generator to reconstruct a speech sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram;
calculating voice reconstruction loss according to the original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset voice sample;
and optimizing parameters in the voice conversion model based on the pair loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
2. The method for training a speech conversion model according to claim 1, wherein the preset countermeasure network includes a prediction layer and a gradient inversion layer; the decoupling of the voice sample features based on the preset mask strategy and the preset countermeasure network to obtain sample feature representation, and calculating the countermeasure loss in the decoupling process comprises the following steps:
generating a random mask based on the preset mask policy; the random mask is used for randomly masking one sample characteristic of the sample content characteristic, the sample tone characteristic, the sample rhythm characteristic and the sample pitch characteristic, so that the prediction layer predicts the masked sample characteristic based on three other sample characteristics except the masked sample characteristic;
Calculating the challenge loss based on the random mask and the speech sample feature after the random mask;
and decoupling the voice sample characteristics based on the gradient inversion layer and the countermeasures to obtain sample characteristic representation.
3. The speech conversion model training method according to claim 2, wherein the calculating the challenge loss based on the random mask and the speech sample feature comprises:
the challenge loss is calculated according to the following formula:
L adv =||(1-M)·(Z-MAP(M·Z)||,
wherein z= (Z r ,Z c ,Z f ,Z u ),M∈(0,1,1,1),(1,0,1,1),(1,1,0,1),(1,1,1,0);
Wherein L is adv Means said countermeasures against losses; m refers to the random mask; z is Z r Refers to the sample rhythm feature, Z c Refers to the sample content features, Z f Refers to the pitch characteristics of the sample, Z u Means the sample tone characteristics; z is Z r 、Z c 、Z f 、Z u Is a concatenation of vectors; MAP refers to mean average accuracy.
4. The method according to claim 1, wherein the calculating the speech reconstruction loss according to the original sample mel-spectrogram of the target sample mel-spectrogram corresponding to the preset speech sample comprises:
the speech reconstruction loss is calculated as follows:
wherein L is recon Refers to speech reconstruction loss; s refers to the original sample Mel spectrogram; And the target sample Mel spectrogram is referred.
5. The method according to any one of claims 3 or 4, wherein optimizing parameters in the speech conversion model based on the pair-wise loss and the speech reconstruction loss results in a trained speech conversion model, comprising:
model loss was calculated as follows:
L=α*L adv +β*L recon
wherein L refers to model loss, alpha refers to the weight of the counterloss, beta refers to the weight of the voice reconstruction loss, and the value ranges of alpha and beta are 0, 1;
and when the model loss reaches a preset convergence condition, the voice conversion model converges to obtain a trained voice conversion model.
6. A speech conversion method, the speech conversion method comprising:
extracting voice information of a source speaker and a target speaker; the voice information comprises voice content information, tone information, rhythm information and pitch information;
inputting the voice information into a trained voice conversion model for voice conversion to obtain a target Mel spectrogram; wherein the trained speech conversion model is trained by the speech conversion model training method according to any one of claims 1 to 5;
And converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
7. A speech conversion model training apparatus, characterized in that the speech conversion model training apparatus comprises:
and an extraction module: the voice sample characteristic is extracted from a preset voice sample through an encoder; the voice sample features comprise sample content features, sample tone features, sample rhythm features and sample pitch features;
and a decoupling module: the method comprises the steps of decoupling the voice sample characteristics based on a preset mask strategy and a preset countermeasure network to obtain sample characteristic representation, and calculating countermeasure loss in the decoupling process; the sample feature representation is used to characterize the enhanced speech sample feature;
and a reconstruction module: the method comprises the steps of inputting the sample characteristic representation into a generator, and training the generator to reconstruct a voice sample Mel spectrogram according to the sample characteristic representation to obtain a target sample Mel spectrogram;
the calculation module: the method comprises the steps of calculating voice reconstruction loss according to an original sample Mel spectrogram corresponding to the target sample Mel spectrogram and the preset voice sample;
training module: and the method is used for optimizing parameters in the voice conversion model based on the pair-loss resistance and the voice reconstruction loss to obtain a trained voice conversion model.
8. A speech conversion apparatus, characterized in that the speech conversion apparatus comprises:
and an extraction module: the voice information of the source speaker and the target speaker is extracted; the voice information comprises voice content information, tone information, rhythm information and pitch information
A first conversion module: the method comprises the steps of inputting voice information into a trained voice conversion model to perform voice conversion to obtain a target Mel spectrogram; wherein the trained speech conversion model is trained by the speech conversion model training method according to any one of claims 1 to 5;
and a second conversion module: and the method is used for converting the target Mel spectrogram into a waveform by adopting a preset algorithm to obtain the synthesized voice.
9. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the computer readable instructions when executed by the processor implement the speech conversion model training method of any of claims 1-5 or the computer readable instructions when executed by the processor implement the speech conversion method of claim 6.
10. One or more readable storage media storing computer-readable instructions that, when executed by a processor, perform the speech conversion model training method of any of claims 1-5, or that, when executed by a processor, perform the speech conversion method of claim 6.
CN202310688583.0A 2023-06-09 2023-06-09 Voice conversion model training method, voice conversion method, device and medium Pending CN116959465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310688583.0A CN116959465A (en) 2023-06-09 2023-06-09 Voice conversion model training method, voice conversion method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310688583.0A CN116959465A (en) 2023-06-09 2023-06-09 Voice conversion model training method, voice conversion method, device and medium

Publications (1)

Publication Number Publication Date
CN116959465A true CN116959465A (en) 2023-10-27

Family

ID=88457297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310688583.0A Pending CN116959465A (en) 2023-06-09 2023-06-09 Voice conversion model training method, voice conversion method, device and medium

Country Status (1)

Country Link
CN (1) CN116959465A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476240A (en) * 2023-12-28 2024-01-30 中国科学院自动化研究所 Disease prediction method and device with few samples

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476240A (en) * 2023-12-28 2024-01-30 中国科学院自动化研究所 Disease prediction method and device with few samples
CN117476240B (en) * 2023-12-28 2024-04-05 中国科学院自动化研究所 Disease prediction method and device with few samples

Similar Documents

Publication Publication Date Title
CN110335587B (en) Speech synthesis method, system, terminal device and readable storage medium
CN112712813B (en) Voice processing method, device, equipment and storage medium
US11355097B2 (en) Sample-efficient adaptive text-to-speech
US11417316B2 (en) Speech synthesis method and apparatus and computer readable storage medium using the same
CN111326168B (en) Voice separation method, device, electronic equipment and storage medium
CN112466314A (en) Emotion voice data conversion method and device, computer equipment and storage medium
CN116030792B (en) Method, apparatus, electronic device and readable medium for converting voice tone
CN111444382A (en) Audio processing method and device, computer equipment and storage medium
CN116959465A (en) Voice conversion model training method, voice conversion method, device and medium
CN114360502A (en) Processing method of voice recognition model, voice recognition method and device
CN113450765A (en) Speech synthesis method, apparatus, device and storage medium
CN113362804B (en) Method, device, terminal and storage medium for synthesizing voice
CN113077783B (en) Method and device for amplifying small language speech corpus, electronic equipment and storage medium
CN113761841B (en) Method for converting text data into acoustic features
CN112712789B (en) Cross-language audio conversion method, device, computer equipment and storage medium
CN115171666A (en) Speech conversion model training method, speech conversion method, apparatus and medium
CN116469359A (en) Music style migration method, device, computer equipment and storage medium
Lee et al. Simple gated convnet for small footprint acoustic modeling
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
CN113990347A (en) Signal processing method, computer equipment and storage medium
CN113112969A (en) Buddhism music score recording method, device, equipment and medium based on neural network
CN112509559A (en) Audio recognition method, model training method, device, equipment and storage medium
CN112687262A (en) Voice conversion method and device, electronic equipment and computer readable storage medium
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN113409769B (en) Data identification method, device, equipment and medium based on neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination