CN112530403A - Voice conversion method and system based on semi-parallel corpus - Google Patents

Voice conversion method and system based on semi-parallel corpus Download PDF

Info

Publication number
CN112530403A
CN112530403A CN202011460130.5A CN202011460130A CN112530403A CN 112530403 A CN112530403 A CN 112530403A CN 202011460130 A CN202011460130 A CN 202011460130A CN 112530403 A CN112530403 A CN 112530403A
Authority
CN
China
Prior art keywords
encoder
speaker
decoder
training
tts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011460130.5A
Other languages
Chinese (zh)
Other versions
CN112530403B (en
Inventor
吴梦玥
徐志航
陈博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangming Daily
Shanghai Jiaotong University
Original Assignee
Guangming Daily
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangming Daily, Shanghai Jiaotong University filed Critical Guangming Daily
Priority to CN202011460130.5A priority Critical patent/CN112530403B/en
Publication of CN112530403A publication Critical patent/CN112530403A/en
Application granted granted Critical
Publication of CN112530403B publication Critical patent/CN112530403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present disclosure relates to a scheme for training a speech conversion model, comprising: in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker; in the VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using the acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.

Description

Voice conversion method and system based on semi-parallel corpus
Technical Field
The present application relates to the field of voice conversion, and in particular, to a method and system for voice conversion based on semi-parallel corpora.
Background
Voice Conversion (VC) refers to changing original speaker information in a Voice to a specific target speaker by changing the tone and pitch of the Voice without changing semantic information in the Voice. The voice conversion technology is widely applied to the field of voice signal processing, and particularly has very wide application prospects in the fields of personalized voice synthesis, pronunciation assistance, voice enhancement, multimedia entertainment and the like. With the maturity of the deep neural network, the voice conversion also comprehensively enters the neural network era, and the conversion performance is obviously improved.
According to different training data conditions, the speech conversion can be divided into parallel corpus based speech conversion and non-parallel corpus based speech conversion, wherein the parallel corpus based speech conversion generally means that training corpuses of an original speaker and a target speaker have the same text content, and the non-parallel corpus based speech conversion does not have the same text corpus condition.
The speech conversion technology based on parallel corpora is divided into two types:
1. parallel corpora with different lengths are converted into parallel corpora with the same length through dynamic time warping, and then a conversion network is trained through some sequence methods with fixed modeling length, such as DNN, LSTM and the like.
2. By using a sequence-to-sequence (sequence-to-sequence) conversion method, the model learns the relation between the original characteristic sequence and the target characteristic sequence through an attention mechanism, thereby realizing the modeling of the dynamic length.
There are three different lines of speech conversion technology based on non-parallel corpora:
1. phoneme posterior probability graph method (Phonetic PosteriorGrams, PPGs)
The core idea of this approach is to use a speaker independent feature as an intermediate feature to mediate between the original and target acoustic features. The intermediate features can be extracted from the voice of any original speaker through the extractor of the speaker independent features, and then the voice conversion can be realized only by training a mapping model from the speaker independent features to the acoustic features of the target speaker. The most intuitive speaker independent feature is a text feature, so the text uses the phoneme posterior probability map corresponding to each frame as an intermediate feature, and uses an Automatic Speech Recognition (ASR) system as an extractor of the feature.
2. Counter training method
The countermeasure training method mainly refers to a series of works represented by a Cycle-consistency general adaptive network (CycleGAN). A voice conversion method based on CycleGAN is proposed in 2017, the method is based on dual learning and comprises two generation models which are dual with each other, the two dual models are connected in series, two cycles can be obtained to reconstruct characteristics, meanwhile, a discriminator is added to restrain a reconstructed intermediate result, and unsupervised training is achieved. In the testing stage, only one generator in the four models is needed as a conversion model, and the conversion process has no essential difference from a standard voice conversion method.
3. Variational self-encoder method
The Variational Auto Encoder (Variational Auto Encoder) is divided into two models of an Encoder and a decoder, wherein the Encoder converts input acoustic features into speaker-independent hidden variables, and then the hidden variables are restored into the input of the Encoder through the decoder. The voice conversion method based on VAE is based on the assumption of information extraction: each frame of acoustic features contains speaker information and speaker independent information, and the encoder can extract as much speaker independent information as possible from each frame of acoustic feature vector, and the KL divergence constraint in the VAE is, in essence, a constraint that attempts to remove speaker information from the acoustic features.
However, whatever the voice conversion technique described above, there are its own drawbacks, as follows:
from the task perspective, the sequence-to-sequence conversion method generally requires more training data, the cost and difficulty of collecting parallel corpora are often high, it is difficult to collect a large amount of parallel corpora, and it is not practical in the actual use process. Secondly, the attention mechanism-based conversion method in the parallel corpus method is prone to semantic errors due to instability of the attention mechanism.
In the non-parallel corpus method, the phoneme posterior probability graph method and the variable molecular coder method are realized based on the ideas of decoupling and information extraction, so that the information leakage condition is easy to occur, and the converted timbres are not similar. Based on the method of countertraining, the training of GAN is unstable due to the particularity of its model structure.
These disadvantages may be caused by various reasons, for example:
semantic errors of the attention-based conversion method in the parallel corpus method are mainly caused by the instability of attention itself.
In the non-parallel corpus method, the reason why the timbres converted by the phoneme posterior probability graph method and the variable molecular coder method are not similar is that there is no way to ensure strict decoupling in the information extraction process, i.e. it cannot be ensured that the extracted speaker irrelevant information does not contain the information of the original speaker, so that the timbre of the speaker to be transferred to the target speaker will be deviated. The method based on the countertraining cannot ensure that the model can be converged certainly due to the particularity of the GAN network, so that the training is very unstable, and the method is difficult to adapt to all data types.
There is therefore a need to provide a solution that provides robust speech conversion techniques.
Disclosure of Invention
The application relates to a speech conversion technology based on semi-parallel corpora.
According to an aspect of the present application, there is provided a method for training a speech conversion model, comprising: in a TTS (speech synthesis) pre-training stage, determining initialization network parameters of a VC decoder and a reference encoder by training a TTS encoder, the VC decoder and the reference encoder by using text and acoustic feature data of a speaker; in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.
According to another aspect of the present application, there is provided a speech conversion system comprising a TTS encoder, a reference encoder, a VC encoder, and a VC decoder, wherein the speech conversion system is configured to: in a TTS pre-training stage, determining initialization network parameters of the VC decoder and the reference encoder by training the TTS encoder, the VC decoder and the reference encoder using speaker text data; in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker to determine final network parameters of the VC encoder, the reference encoder and the VC decoder which are pre-trained.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
In order to describe the manner in which the above-recited and other advantages and features of the invention can be obtained, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
FIG. 1 is an example block diagram of a speech conversion platform according to one embodiment of the present application.
FIG. 2 is an example flow of a method for training a speech conversion model according to one embodiment of the present application.
Detailed Description
As mentioned above, both the existing parallel corpus based speech conversion techniques and the non-parallel corpus based speech conversion techniques have their own drawbacks.
In order to solve the data problem in the speech conversion technology based on parallel corpora, the most direct method is to record a large amount of parallel corpora at high cost so as to meet the requirement of training from a sequence to a sequence model. On the other hand, for the non-parallel based speech conversion technology, the most common method is to fine-tune the model training against training or information bottleneck through a large amount of network parameter adjustment. However, either of the above solutions requires complex additional steps to ensure accurate and stable speech conversion. This typically means that the user must spend more expensive hardware costs, heavier training processes, and more latency to obtain the desired conversion results. However, the voice conversion itself is a technology with high real-time requirement, so the above solutions can not really meet the requirement of the voice conversion technology.
Therefore, the application provides a speech conversion technology based on semi-parallel corpora, which is a scene closer to the real situation. Moreover, the technology is also based on sequence to sequence, so that the current mainstream TTS technology can be fully used for reference, and the model parameters of the VC are initialized by using the best TTS model structure, thereby greatly reducing the requirements of the model on parallel data. Moreover, training on parallel data in the final stage is actually a small data adaptive method, which can sufficiently reduce the requirement of data volume, and training and convergence are very fast.
In general, the robust semi-parallel corpus-based speech conversion technique proposed by the present application mainly includes the following three stages:
in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker;
in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and
in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.
Therefore, the decoder in the speech synthesis model is pre-trained by using non-parallel training data, the pre-trained decoder is used for further training the encoder of the acoustic feature, and finally a small amount of parallel linguistic data is used for migrating from the initialization stage to the speech conversion task to perform adaptive learning, so that the problem of data volume required by training from the sequence to the sequence model is solved.
Meanwhile, by using the independent duration model to control the output characteristics after voice conversion, the method and the device improve the semantic error problem in the conversion method based on the attention mechanism to a certain extent and improve the conversion accuracy. In particular, from the perspective of a speech conversion scheme of non-parallel corpora, since a real time duration is introduced in the training process to help information compression and extraction, rather than a KL divergence or fixed length downsampling method, it can ensure the stability of information extraction and the quality of the finally converted sound.
With the above overview in mind, an exemplary block diagram of a speech conversion platform according to one embodiment of the present application is described below with reference to FIG. 1.
As shown in the figure, the voice conversion platform mainly includes data sources 101(1), 101(2), …, 101(n) for providing various data and a voice conversion system 110. The speech conversion system 110 obtains the desired data from various data sources via, for example, a wireless or wired connection, depending on the operational requirements. The wireless connection may include a network, such as the internet, WLAN, cellular networks, and bluetooth, NFC, WIFI, and the like. The wired connection may include a cable, a USB connection, a type-C connection, and the like. Of course, if necessary, various data may also be input into the voice conversion system 110 through, for example, a removable storage medium (e.g., a floppy disk, a hard disk, an optical disk, a U-disk), and the like.
For example, in an initial stage, when the speech conversion model in the speech conversion system 110 needs to be trained, TTS pre-training may be performed on the speech conversion model by first obtaining text of one or more speakers (e.g., a speech script of the speaker) from the data source 101 (1); subsequently, audio data of the original speaker can be obtained from another data source 101(2) for input to a speech conversion model for VC (pre) training. While in the use phase, real-time speech data of the original speaker that needs to be converted (e.g., real-time speech data obtained from a microphone) may be obtained from the data source 101(n) to perform speech conversion to the speech data of the target speaker. Thus, depending on the task, the speech conversion system may interface with the corresponding data source to obtain the desired input data.
Next, as shown, the speech conversion system 110 may include an input module 111, a speech conversion model, and an output module 117. The voice conversion model mainly comprises: TTS encoder 112, VC encoder 113, duration extension module 114, VC decoder 115, and reference encoder 116. These portions are in data communication with each other via wired or wireless connections.
The input module 111 is primarily configured to receive various data required by the speech conversion model from a data source.
The TTS encoder 112 is primarily configured to encode output context information from the received text sequence of the original speaker during the TTS training phase, the context information providing semantic information.
VC encoder 113 is primarily configured to encode, during a VC (pre) training phase, output context information based on received acoustic features of the original speaker (the acoustic features being extracted from the audio data of the original speaker), the context information providing semantic information.
The duration module 114 is primarily configured to up-sample a text-length sequence of context vectors to the same length as the acoustic feature length or down-sample the acoustic feature sequence to the length of the text sequence to reflect prosodic information of the speaker. The duration module 114 may include an ASR model that extracts the true duration from the speech data to provide prosody information during the training phase and a duration prediction network that uses the predicted duration to provide prosody information during the testing phase.
VC decoder 115 is primarily configured to decode context information from the encoder back into acoustic features.
The reference coder 116 is primarily configured to extract the speaker's timbre information from the speaker's reference audio data to facilitate distinguishing between different speakers.
The output module 117 is mainly configured to output the audio data of the target speaker converted by the speech conversion model to the user, for example, by providing the audio data to a speaker, a loudspeaker, a headphone, or the like to directly play the audio data.
It should be appreciated that in order to provide good voice conversion services, the voice conversion model must be trained and tested before formal voice conversion is performed. Only a properly trained speech conversion model can provide satisfactory conversion results. The scheme of the application aims at improving the training scheme of the voice conversion model.
With the system architecture of the speech conversion system of the present application in mind, an example flow of a method for training a speech conversion model in accordance with one embodiment of the present application is described below in conjunction with FIG. 2.
Before beginning the description of the training scheme, some basic concepts in down-conversion of speech are introduced to facilitate understanding by the skilled person.
Basic concept
A speech contains various types of information, and according to the characteristics of speech and pronunciation of people, it can be assumed that the speech at least contains three types of information: semantic information (content), speaker timbre information (time), and prosodic information (rhythm). In general, a triplet (c, t, r) may be used. Wherein c, t and r can respectively represent semantic information, tone information and prosodic information. For ease of distinction, they may also be individually configured with the subscripts "src" indicating that the information originated from the original speaker or "trg" indicating that the information originated from the target speaker. For the speech conversion task, it is a kind of original speaker's speech to be inputted (c)src,tsrc,rsrc) Speaker timbre information t in (1)srcThe triplets are stripped and replaced with the timbre information t of the target speakertrgWhile preserving semantic information csrcThus constructing a new triplet (c)src,ttrg,ravg) To implement the voice conversion process. Among them, there is generally no strict regulation on the prosody information that the prosody of the original speaker must be used or the prosody of the target speaker must be used. Here, according to the common practice, a certain average prosody r can be assumed firstavgExperiments were performed. The average prosody may be a statistical average of prosody information of a plurality of persons. Under such requirements, the most important requirements for a good sound conversion system are: in preserving original speaker semantics csrcOn the premise of removing the tone color information t of the original speaker as much as possiblesrcAnd by increasing the tone information t of the target speakertrgHigh quality acoustic features are restored.
Based on the basic idea of the voice conversion, the application designs a new training method of the voice conversion model based on the semi-parallel corpora, and in summary, the voice conversion model from the sequence to the sequence is pre-trained through a large amount of non-parallel corpora, and then the pre-trained voice conversion model is self-adaptively trained by using a small amount of parallel corpora, so that the high-quality voice conversion model can be quickly obtained.
First, as illustrated in FIG. 1, the speech conversion model can be generally divided into four parts: the device comprises an encoder, a reference encoder, a duration module and a decoder. This structure can describe both the framework for speech synthesis and the framework for voice conversion. In this structure, to distinguish between TTS and VC, the encoder and decoder can be further subdivided by task names as follows, depending on the features of the model: a TTS encoder 112 that encodes text into context information, a VC encoder 113 that encodes speech features into context information, and a VC decoder 115 that decodes the context information back into speech features. Note that no TTS decoder is mentioned here, since text-based self-coders are of little significance for synthesis or voice conversion.
As shown in fig. 2, the training method of the voice conversion model can be divided into three stages, namely, a TTS pre-training stage, a VC pre-training stage, and a VC training stage.
Firstly, in a TTS pre-training stage, supposing that for a certain training speaker spk, an input module receives text data of the speaker from a corresponding data source, and a TTS encoder encodes and outputs an encoder hidden layer H containing text context information according to the received text datattsWhich provides semantic information cspk. Meanwhile, the voice conversion model may receive the speaker's timbre information t, which is output from the reference encoder based on the reference feature encodingspkAdding prosodic information r provided in the duration information obtained from the duration modelspkIt can pass through the triplet (c)spk,tspk,rspk) The acoustic features of the speaker spk are reconstructed, training the TTS encoder and VC decoder networks. Therefore, through TTS pre-training, a conventional multi-speaker TTS model based on an independent duration model can be trained, and meanwhile, the fact that a VC decoder subjected to the pre-training can synthesize corresponding high-quality audio when a correct hidden layer representation H is given is also ensured.
And secondly, initializing and fixing network parameters of the VC decoder in a VC pre-training stage. Because the voice conversion model is pre-trained by TTS with a large amount of non-parallel linguistic data such as text before initializing the VC decoder, the initialization of the VC decoder in the VC pre-training stage is not random any more, but relates to a relatively optimized range, thereby greatly reducing the requirement on parallel data in subsequent model training.
Then, the VC coder is trained in a self-coder mode. Similar to the previous step, the VC encoder receives the acoustic features O of the speaker spk from the data source via the input modulespkAs input, and encodes the output encoder hidden layer HvcThe hidden layer provides semantic information cspk. And the speaker timbre information t of the speaker is provided with reference to the output of the encoderspkAnd adding prosodic information r provided by the duration information obtained from the duration modulespkIt can pass through the triplet (c)spk,tspk,rspk) The acoustic features of the speaker spk are reconstructed. The fixation of VC decoder will make VC coder forcedly code input characteristics into hidden layer representation with same output distribution as TTS coder, i.e. let HttsAnd HvcThe distribution of (a) is kept consistent. While keeping the distribution consistent has two benefits, one is to ensure HvcContains enough semantic information to reconstruct high-quality acoustic features, which is due to HttsIs the result of plain text input and must not contain speaker timbre information, thereby realizing constraint HvcThe tone information of the speaker is not extracted, and the effect of leakage of the speaker information is reduced.
In another embodiment, such a training process is still possible without a fixed VC decoder. However, this cannot guarantee the output H of the VC encodervcMust not contain tone information tspkThus, a problem of leakage of speaker information from the VC encoder, which is detrimental to the VC training of the third step, may occur.
And thirdly, in the VC training stage, on the basis of initializing the VC encoder and the VC decoder obtained by the pre-training of the previous two steps, quickly training the pre-trained VC decoder, the VC encoder and the reference encoder again by using a small amount of parallel linguistic data so as to obtain a corresponding voice conversion model.
The pre-training-based method introduces more prior knowledge, and greatly reduces the data volume requirement of the sequence-to-sequence speech conversion model, so that the good speech conversion model can be obtained by training more easily and quickly.
For the specific model selection of the encoder and decoder involved in the above scheme, one of the most popular TTS model structures can be used: and a Transformer-based coder and a decoder. The Transformer is a novel network structure which replaces a traditional sequence modeling network (such as LSTM) and can sufficiently extract the context information of a sequence through self-attention calculation and a multi-attention mechanism. It should be understood that the transform-based encoder and decoder is only one example of a specific model, and the model is not limited to a transform. In fact, many other models are equally applicable to the speech conversion scheme of the present application.
Meanwhile, in order to increase the amount of data for model pre-training, in some embodiments, training data of multiple speakers may be introduced, and speaker embedded information (i.e., tone information of the speakers) is to distinguish different speakers, so that the encoder can fully learn important input parameters of speaker independent information. The method for extracting the speaker embedded information at the phoneme level based on the attention mechanism is used in the scheme, the corresponding relation between the output of the encoder and the reference audio is calculated for the input reference audio, the speaker embedded information at the phoneme level is extracted, the extracted speaker embedded information is spliced on the output of the encoder, and then the speaker embedded information is input into the decoder. For the reference features input to the reference encoder, in the training phase, a plurality of sentences randomly selected from all sentences of the current speaker are used (the number of sentences is determined by the size of the video memory); in the testing stage, all sentences of the target speaker are input as much as possible so as to achieve a better tone extraction effect.
In addition to providing an improvement of pre-training prior to formally training the speech conversion model, the present application also provides an improved duration model.
As described above, although the prosody information r in the information hypothesis is not strictly limited at the time of voice conversion, the prosody information r actually has some influence on the voice conversion system. Incomplete or inaccurate extraction of prosodic information from the original speaker may result in semantic errors in the converted speech. And in some specific situations, it may also be desirable to convert the prosodic information of the speaker during the speech conversion process so that the converted audio has the speaking rhythm of the target speaker.
In order to solve the series of problems, the application introduces a separate duration model in the speech conversion model. The duration model and prosodic information are highly correlated, which models the length of the acoustic feature corresponding to a word in the current context. Therefore, we extract prosody information from the input speech through an independent duration model and input new prosody information to re-synthesize audio upon reconstruction or conversion.
Generally speaking, in the training phase, the duration information of the training data can be obtained through the pre-trained ASR model in the duration module, and in the testing phase, the duration may not be directly obtained, so that similarly in TTS, a duration prediction network implemented by a bidirectional LSTM network is also trained at the same time to provide the duration information.
The duration module is formed by applying the duration model to the current sequence-to-sequence model.
The duration model has two important functions: up-sampling and down-sampling. Upsampling refers to expanding a text sequence into a sequence of feature lengths (which may also be referred to as "duration expansion"), while downsampling refers to downsampling a feature sequence into a sequence of text lengths.
During up-sampling, a context vector sequence with a text length is repeatedly expanded into a sequence with the same length as the acoustic characteristic length, and information such as position coding and the like is added. This operation is a very common structure in end-to-end speech synthesis, and by the up-sampling, good stability can be brought in the TTS task, which helps to improve the synthesis instability caused by the defect of too small amount of parallel data in the VC task, and improve the naturalness of the finally converted audio.
During down-sampling, the acoustic feature sequence is subjected to time length segmentation, and each variable-length feature segment is converted into a variable with a fixed length through pooling operation, so that the feature sequence with the same length as the text is formed.
Similarly, the roles of the duration model in the various stages are described below in conjunction with the three stages of training the speech conversion model described above.
The first step is as follows: in the TTS pre-training stage, the text and the corresponding duration naturally separate the prosodic information, so that the duration model only needs to carry out H output by a TTS encoderttsUpsampling (instant long expansion) is performed.
The second step is that: in the VC pre-training stage, the time length model firstly performs down-sampling on the output of the VC coder, and the prosodic information is also separated so as to carry out H output by the VC codervcDown-sampled to the text length and then up-sampled.
The third step: in the VC training phase, the duration model performs the same operation as the second step.
In some embodiments, if prosodic conversion is also considered in the speech conversion, the duration prediction network of the target speaker may be trained to obtain duration information that conforms to the prosodic rhythm of the target speaker for upsampling.
Having described an exemplary flow of a method for training a speech conversion model, for ease of understanding, a specific flow of the method is described below in connection with a specific application example.
First, two data sets are agreed: small parallel corpus data set DVC={Dsrc,DtrgAnd a corpus of multiple speakers non-parallel corpus data sets DTTS={Dspk1,Dspk2… }. Where all data for a speaker spk can be expressed as { (text)spk,audiospk) The set of data pairs. The audio data is represented as O after extraction of the acoustic features. src and trg, respectivelyRepresenting the original speaker and the target speaker that ultimately need to be converted. Thus, the parallel corpus data set after acoustic feature extraction can be represented as { O }src,Otrg}. To accomplish training and testing of the relevant duration module, the duration information, denoted as u, may be extracted using a pre-trained ASR modelspkFinally, the data set for each speaker can be represented as Dspk={(textspk,Ospk,uspk)}。
As mentioned above, the main idea of the whole training process is to use a large amount of easily available non-parallel corpus data D in the pre-training stageTTSTo learn an initialized network parameter (or parameter range) as an initial parameter of the speech conversion model. Then, a small limited parallel corpus data set D is usedVCAnd carrying out self-adaptive adjustment on the voice conversion model to obtain a final voice conversion model. Compared with the traditional scheme of training the voice conversion model by directly adopting random parameters, the pre-trained voice conversion model can greatly reduce the requirement of the model on parallel data during subsequent formal training. The training of the whole model is divided into the following three steps
1. TTS pre-training stage: use of DTTSTraining a multi-speaker TTS model by a data set to finish the pre-training of a VC decoder, wherein the TTS pre-training specifically comprises the following operations:
a) and randomly initializing network parameters of the TTS encoder, the VC decoder and the reference encoder.
b) Will DTTSText sequence text of speaker spk in datasetspkThe processed front text is input into a TTS encoder, and the output H of the TTS encoder is obtained through encoding processingtts
c) Inputting the reference features of the reference audio based on the speaker spk into a reference encoder to obtain speaker embedded information (such as tone color information in a representation variable form), and comparing the speaker embedded information with the output H of a TTS encoderttsDimension splicing (also called variable matrix splicing) is carried out to obtain Htts'。
d) Outputting H to the spliced TTS encoder through a duration moduletts' carrying outUpsampling to obtain an upsampled TTS encoder output Hextend
e) Up-sampled HextendInput into VC decoder network for decoding to obtain predicted result
Figure BDA0002831223420000101
f) Acoustic features O for speakers in a set of data pairsspkAnd predicting the result
Figure BDA0002831223420000102
The Error between the TTS encoder and the VC decoder is calculated by using a Mean Square Error (MSE) loss criterion, a back propagation gradient, a training strategy of an initial learning rate of 1e-3 and a Noam learning rate attenuation criterion, and an Adam optimizer, and network parameters of the TTS encoder, the VC decoder and the reference encoder are updated until the Error value is converged. After the TTS pre-training is completed, the TTS encoder is discarded, and the trained model parameters of the VC decoder and the reference encoder are reserved as initialization network parameters.
g) TextspkInput to the duration prediction network to output the duration prediction value
Figure BDA0002831223420000103
h) Calculating duration information uspkSum duration prediction
Figure BDA0002831223420000111
And the error between the time length and the time length is transmitted back to the gradient, and the network parameters of the time length prediction model are updated until convergence.
2. VC pre-training stage: network initialization and fixing are performed using the VC decoder model parameters trained in the first step, and then D is usedTTSAll audio data in the data set OspkThe VC pre-training specifically comprises the following operations:
a) and performing network initialization by using the VC decoder trained in the first step and the reference encoder parameters, and randomly initializing the network parameters of the VC encoder.
b) Acoustic feature O of speaker spkspkInputting VC coder for coding, and down-sampling coding result by time length module to obtain down-sampled VC coder output Hvc1
c) Inputting the reference characteristics of the reference audio based on the speaker spk into a reference encoder to obtain speaker embedded information, performing dimensionality splicing on the speaker embedded information and the output of a VC encoder subjected to down-sampling to obtain the output H of the VC encoder subjected to dimensionality splicingvc1'。
d) Outputting H to the spliced VC encoder through the time length modulevc1' upsampling to obtain an upsampled VC encoder output Hextend1
e) Up-sampled Hextend1Input into VC decoder network for decoding to obtain predicted result
Figure BDA0002831223420000112
f) Acoustic features O for a speakerspkAnd predicting the result
Figure BDA0002831223420000113
And calculating the Error between the average Square Error (MSE) and the average Square Error (MSE) loss criterion, reversely transmitting the gradient, using a training strategy of an initial learning rate of 1e-3 and a Noam learning rate attenuation criterion, and network parameters of an Adam optimizer, a fixed VC decoder and a reference encoder, and updating network parameters of the VC encoder until the Error value is converged.
3. And (3) VC training stage: network initialization using VC coder parameters from the second training step, using slave DVCAll acoustic features extracted from all audio data in the data set Osrc,OtrgTraining all model parameters in a self-adaptive mode, wherein the VC training specifically comprises the following operations:
a) and initializing the network parameters of the VC coder trained in the second step.
b) Acoustic features O of original speakersrcInputting VC coder for coding, and down-sampling coding result by time length moduleTo obtain a down-sampled VC encoder output Hvc2
c) Inputting reference characteristics of reference audio based on the trg of the target speaker to a reference encoder to obtain speaker-embedded information, and outputting H with a down-sampled VC encodervc2Performing dimension splicing to obtain spliced Hvc2'。
d) Outputting H to the spliced VC encoder through the time length modulevc2' upsampling to obtain an upsampled VC encoder output Hextend2
e) Outputting H from up-sampled VC coderextend2Inputting the data into a VC decoder network for decoding to obtain a predicted target result
Figure BDA0002831223420000121
f) Acoustic features O for a targeted speakertrgAnd predicting the target outcome
Figure BDA0002831223420000122
Calculating the Error between the average Square Error (MSE) loss criterion and the back propagation gradient by using an average Square Error (MSE) loss criterion, and updating all network parameters of the VC encoder, the VC decoder and the reference encoder until convergence by using a training strategy with a fixed learning rate of 1e-4 and an Adam optimizer.
The method of training a speech conversion model according to the present application ends.
After training is completed, the trained speech conversion model is typically subjected to a testing phase to test the speech conversion model before being used in a speech conversion task to further improve the conversion performance of the speech conversion model.
The flow in the test phase can also be similarly expressed as: inputting the test audio of an original speaker, performing VC encoding and decoding, and outputting a target conversion audio, wherein the specific operations are as follows:
a) and carrying out network initialization on the parameters of the VC decoder, the VC encoder and the reference encoder which are trained according to the training method.
b) Test acoustic features O to be extracted from test audio of an original speakersrctestInput to a VC encoder for encoding, and down-sampled by a duration module to obtain a down-sampled VC encoder output Hvctest
c) The reference characteristics of a large (as much as possible) amount of reference audio based on the target speaker trg are input to a reference encoder to obtain speaker-embedded information, which is output H with a down-sampled VC encodervctestPerforming dimension splicing to obtain output H of spliced VC encodervctest'。
d) Inputting the test text test of the original speaker by using the trained duration prediction networksrctestTo obtain predicted duration information
Figure BDA0002831223420000123
e) Using predicted duration information
Figure BDA0002831223420000124
Outputting H to spliced encoder through time length modulevctest' upsampling to obtain an upsampled VC encoder output Hextendtest
f) Up-sampled HextendtestInput VC decoder network for decoding to obtain prediction result
Figure BDA0002831223420000125
g) Will be provided with
Figure BDA0002831223420000126
Inputting the vocoder to obtain converted target test audio data
Figure BDA0002831223420000127
In general, the training scheme of the present application based on the speech conversion model of semi-parallel corpora does not require high data size of the original and target speakers, and can even help to generate parallel corpora by directly training a TTS model. The core idea of the scheme is that through pre-training of a self-coding network of TTS and VC, the parallel training data volume between an original speaker and a target speaker can be reduced greatly, the data volume requirement of a speech conversion model from a sequence to a sequence is greatly reduced, and meanwhile, professional audio with high quality and high reliability can be generated.
Although in the above described embodiment separate time duration models are used between the encoder and decoder, it will be appreciated that it is indeed possible to use the more common attention mechanism in sequence-to-sequence tasks, with only a slightly inferior conversion effect. In comparison, due to the particularity of the conversion tasks related to TTS/VC or similar time sequences, the hard alignment mechanism of the independent time length model is used for replacing a common attention mechanism, so that the voice conversion model can be helped to synthesize more robust voice, and better user experience is obtained.
It should be noted that the scheme of the present application is mainly an improvement on the training procedure of the speech conversion model, and not an innovation on the codec technology of the encoder and the decoder themselves. Therefore, the data processing methods in the above procedures, such as feature extraction, encoding, decoding, splicing, gradient backward propagation, convergence, etc., can find corresponding contents in the existing voice conversion technology. And will not be specifically described herein. While the Mean Square Error (MSE) loss criterion, the fixed learning rate of 1e-4, and Adam optimizer, etc., described in the above embodiments are merely some example algorithms that may be used to implement the above-described data processing and are not limited thereto. Other algorithms may also be used.
The foregoing describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous. Moreover, those skilled in the relevant art will recognize that the embodiments can be practiced with various modifications in form and detail without departing from the spirit and scope of the present disclosure, as defined by the appended claims. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (10)

1. A method for training a speech conversion model, comprising:
in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker;
in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and
in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.
2. The method of claim 1, wherein, during the TTS pre-training phase:
randomly initializing network parameters of a TTS encoder, a VC decoder and a reference encoder;
encoding the text sequence of the speaker into TTS encoder output by the TTS encoder;
encoding reference characteristics based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the output of the TTS encoder;
upsampling the spliced TTS encoder output by a duration module to obtain an upsampled TTS encoder output;
inputting an upsampled TTS encoder output to the VC decoder for decoding into a prediction; and
and calculating the error between the acoustic characteristic of the speaker and the prediction result, reversely transmitting the gradient, and updating the network parameters of the VC decoder and the reference encoder until convergence.
3. The method of claim 2, further comprising, during said TTS pre-training phase, the steps of:
outputting a duration prediction value by inputting the text sequence of the speaker to a duration prediction network;
and calculating the error between the duration information and the duration prediction value, reversely transmitting the gradient, and updating the network parameters of the duration prediction model until convergence.
4. The method of claim 3, wherein, in the VC pre-training phase:
using the network parameters of the VC decoder and the reference encoder trained in the TTS pre-training stage to perform network initialization and fix, and randomly initializing the network parameters of the VC encoder;
coding the acoustic characteristics of the speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;
encoding reference features based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
upsampling the spliced VC encoder output by the duration module to obtain an upsampled VC encoder output;
inputting an upsampled VC encoder output to the VC decoder for decoding into another prediction result; and
and calculating the error between the acoustic characteristic of the speaker and the other prediction result, reversely transmitting the gradient, and updating the network parameters of the VC encoder until convergence.
5. The method of claim 4, wherein, in the VC training phase:
performing network initialization using the network parameters of the VC encoder trained in a VC pre-training phase;
coding the acoustic characteristics of the original speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;
encoding reference characteristics of a reference audio based on a target speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
upsampling the spliced VC encoder output by the duration module to obtain an upsampled VC encoder output;
inputting the upsampled VC encoder output to the VC decoder for decoding into a predicted target result; and
and calculating the error between the acoustic characteristic of the target speaker and the predicted target result, reversely transmitting the gradient, and updating the network parameters of the VC encoder, the VC decoder and the reference encoder until convergence.
6. The method of claim 5, wherein the method further comprises a testing phase in which:
network initialization is carried out on the trained network parameters of the VC decoder, the VC encoder and the reference encoder;
coding the acoustic features extracted from the test audio of the original speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;
encoding reference characteristics of a reference audio based on a target speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
converting the test text of the original speaker into duration information through a trained duration prediction network;
upsampling the spliced VC encoder output by a duration module using the duration information to obtain an upsampled VC encoder output;
inputting the upsampled VC encoder output to the VC decoder for decoding into a predicted target result; and
and inputting the predicted target result into a vocoder to obtain converted target test audio data.
7. The method of claim 6, wherein the speaker can include multiple speakers, and the speaker embedded information can be used to distinguish between different speakers.
8. The method of claim 1, wherein the speaker's voice includes three kinds of information according to the characteristics of the speaker's utterance: the VC coder output of the VC coder can provide the semantic information, the speaker embedded information of the reference coder can provide the speaker tone information, and the duration module can provide the prosodic information.
9. A speech conversion system comprising a TTS encoder, a reference encoder, a VC encoder, and a VC decoder, wherein the speech conversion system is configured to:
in a TTS pre-training stage, determining initialization network parameters of the VC decoder and the reference encoder by training the TTS encoder, the VC decoder and the reference encoder using speaker text data;
in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and
in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic features of an original speaker to determine final network parameters of the VC encoder, the reference encoder and the VC decoder which are pre-trained.
10. The speech conversion system of claim 9, wherein the duration module is configured to up-sample a text-length sequence of context vectors to a sequence of the same length as the acoustic feature length or down-sample the acoustic feature sequence to the length of the text sequence;
the reference encoder is used for extracting speaker embedded information from reference audio data of speakers so as to distinguish different speakers.
CN202011460130.5A 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus Active CN112530403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011460130.5A CN112530403B (en) 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011460130.5A CN112530403B (en) 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus

Publications (2)

Publication Number Publication Date
CN112530403A true CN112530403A (en) 2021-03-19
CN112530403B CN112530403B (en) 2022-08-26

Family

ID=74999231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011460130.5A Active CN112530403B (en) 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus

Country Status (1)

Country Link
CN (1) CN112530403B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345452A (en) * 2021-04-27 2021-09-03 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model
CN113436609A (en) * 2021-07-06 2021-09-24 南京硅语智能科技有限公司 Voice conversion model and training method thereof, voice conversion method and system
CN113450759A (en) * 2021-06-22 2021-09-28 北京百度网讯科技有限公司 Voice generation method, device, electronic equipment and storage medium
CN113781996A (en) * 2021-08-20 2021-12-10 北京淇瑀信息科技有限公司 Speech synthesis model training method and device and electronic equipment
CN115910002A (en) * 2023-01-06 2023-04-04 之江实验室 Audio generation method, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020027619A1 (en) * 2018-08-02 2020-02-06 네오사피엔스 주식회사 Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN112037754A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Method for generating speech synthesis training data and related equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020027619A1 (en) * 2018-08-02 2020-02-06 네오사피엔스 주식회사 Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN112037754A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Method for generating speech synthesis training data and related equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄国捷等: "《增强变分自编码器做非平行语料语音转换》", 《信号处理》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345452A (en) * 2021-04-27 2021-09-03 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model
CN113345452B (en) * 2021-04-27 2024-04-26 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model
CN113450759A (en) * 2021-06-22 2021-09-28 北京百度网讯科技有限公司 Voice generation method, device, electronic equipment and storage medium
CN113436609A (en) * 2021-07-06 2021-09-24 南京硅语智能科技有限公司 Voice conversion model and training method thereof, voice conversion method and system
CN113436609B (en) * 2021-07-06 2023-03-10 南京硅语智能科技有限公司 Voice conversion model, training method thereof, voice conversion method and system
CN113781996A (en) * 2021-08-20 2021-12-10 北京淇瑀信息科技有限公司 Speech synthesis model training method and device and electronic equipment
CN113781996B (en) * 2021-08-20 2023-06-27 北京淇瑀信息科技有限公司 Voice synthesis model training method and device and electronic equipment
CN115910002A (en) * 2023-01-06 2023-04-04 之江实验室 Audio generation method, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN112530403B (en) 2022-08-26

Similar Documents

Publication Publication Date Title
CN112530403B (en) Voice conversion method and system based on semi-parallel corpus
US11587569B2 (en) Generating and using text-to-speech data for speech recognition models
CN109147758B (en) Speaker voice conversion method and device
CN113439301B (en) Method and system for machine learning
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
JP6989951B2 (en) Speech chain device, computer program and DNN speech recognition / synthesis mutual learning method
Renduchintala et al. Multi-modal data augmentation for end-to-end ASR
KR20220004737A (en) Multilingual speech synthesis and cross-language speech replication
KR20230156121A (en) Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
Luong et al. Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech
KR20230084229A (en) Parallel tacotron: non-autoregressive and controllable TTS
CN111508470A (en) Training method and device of speech synthesis model
CN112002302B (en) Speech synthesis method and device
CN113077783A (en) Method and device for amplifying Chinese speech corpus, electronic equipment and storage medium
Du et al. VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
Gong et al. ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
CN115206281A (en) Speech synthesis model training method and device, electronic equipment and medium
Li et al. Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
Heymans et al. Efficient acoustic feature transformation in mismatched environments using a Guided-GAN
CN117636842B (en) Voice synthesis system and method based on prosody emotion migration
CN112802462B (en) Training method of sound conversion model, electronic equipment and storage medium
CN114333900B (en) Method for extracting BNF (BNF) characteristics end to end, network model, training method and training system
WO2023102932A1 (en) Audio conversion method, electronic device, program product, and storage medium
CN116913255A (en) Voice model training method, voice recognition device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant