CN112530403B - Voice conversion method and system based on semi-parallel corpus - Google Patents

Voice conversion method and system based on semi-parallel corpus Download PDF

Info

Publication number
CN112530403B
CN112530403B CN202011460130.5A CN202011460130A CN112530403B CN 112530403 B CN112530403 B CN 112530403B CN 202011460130 A CN202011460130 A CN 202011460130A CN 112530403 B CN112530403 B CN 112530403B
Authority
CN
China
Prior art keywords
encoder
speaker
decoder
tts
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011460130.5A
Other languages
Chinese (zh)
Other versions
CN112530403A (en
Inventor
吴梦玥
徐志航
陈博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangming Daily
Shanghai Jiaotong University
Original Assignee
Guangming Daily
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangming Daily, Shanghai Jiaotong University filed Critical Guangming Daily
Priority to CN202011460130.5A priority Critical patent/CN112530403B/en
Publication of CN112530403A publication Critical patent/CN112530403A/en
Application granted granted Critical
Publication of CN112530403B publication Critical patent/CN112530403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present disclosure relates to a scheme for training a speech conversion model, comprising: in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker; in the VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using the acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.

Description

Voice conversion method and system based on semi-parallel corpus
Technical Field
The present application relates to the field of speech conversion, and in particular, to a speech conversion method and system based on semi-parallel corpora.
Background
Voice Conversion (VC) refers to changing the original speaker information in a Voice to a specific target speaker by changing the tone and pitch of the Voice without changing the semantic information in the Voice. The voice conversion technology is widely applied to the field of voice signal processing, and particularly has very wide application prospects in the fields of personalized voice synthesis, pronunciation assistance, voice enhancement, multimedia entertainment and the like. With the maturity of the deep neural network, the voice conversion also comprehensively enters the neural network era, and the conversion performance is obviously improved.
According to different training data conditions, the speech conversion can be divided into parallel corpus based speech conversion and non-parallel corpus based speech conversion, wherein the parallel corpus based speech conversion generally means that training corpuses of an original speaker and a target speaker have the same text content, and the non-parallel corpus based speech conversion does not have the same text corpus condition.
The speech conversion technology based on parallel corpora is divided into two types:
1. parallel corpora with different lengths are converted into parallel corpora with the same lengths through dynamic time warping, and then a conversion network is trained through some sequence methods with fixed modeling lengths, such as DNN, LSTM and the like.
2. By using a sequence-to-sequence (sequence-to-sequence) conversion method, the model learns the relation between the original characteristic sequence and the target characteristic sequence through an attention mechanism, thereby realizing the modeling of the dynamic length.
There are three different lines of speech conversion technology based on non-parallel corpora:
1. phoneme posterior probability graph method (Phonetic PosteriorGrams, PPGs)
The core idea of this method is to use a speaker independent feature as an intermediate feature to mediate between the original and target acoustic features. The intermediate features can be extracted from the voice of any original speaker through the extractor of the speaker independent features, and then the voice conversion can be realized only by training a mapping model from the speaker independent features to the acoustic features of the target speaker. The most intuitive speaker independent feature is a text feature, so the text uses the phoneme posterior probability map corresponding to each frame as an intermediate feature, and uses an Automatic Speech Recognition (ASR) system as an extractor of the feature.
2. Counter training method
The countermeasure training method mainly refers to a series of works represented by a Cycle-consistency general adaptive network (CycleGAN). A voice conversion method based on CycleGAN is proposed in 2017, the method is based on dual learning and comprises two generation models which are dual with each other, the two dual models are connected in series, two cycles can be obtained to reconstruct characteristics, meanwhile, a discriminator is added to restrain a reconstructed intermediate result, and unsupervised training is achieved. In the testing stage, only one generator in the four models is needed as a conversion model, and the conversion process has no essential difference from a standard voice conversion method.
3. Variational self-encoder method
A Variational Auto Encoder (Variational Auto Encoder) is divided into two models, namely an Encoder and a decoder, wherein the Encoder converts input acoustic features into speaker-independent hidden variables, and the hidden variables are restored into the input of the Encoder through the decoder. The voice conversion method based on VAE is based on the assumption of information extraction: each frame of acoustic features contains speaker information and speaker independent information, and the encoder can extract as much speaker independent information as possible from each frame of acoustic feature vector, and the KL divergence constraint in the VAE is, in essence, a constraint that attempts to remove speaker information from the acoustic features.
However, whatever the voice conversion technique described above, there are its own drawbacks, as follows:
from the task perspective, the sequence-to-sequence conversion method generally requires more training data, the cost and difficulty of collecting parallel corpora are often high, it is difficult to collect a large amount of parallel corpora, and it is not practical in the actual use process. Secondly, a conversion method based on the attention mechanism in the parallel corpus method is easy to generate semantic errors due to instability of the attention mechanism.
In the non-parallel corpus method, the phoneme posterior probability graph method and the variable molecular coder method are realized based on the ideas of decoupling and information extraction, so that the information leakage condition is easy to occur, and the converted timbres are not similar. The method based on the countertraining is unstable in the training of GAN because of the particularity of the model structure.
These disadvantages may be caused by various reasons, for example:
semantic errors of the attention-based conversion method in the parallel corpus method are mainly caused by the instability of attention itself.
In the non-parallel corpus method, the reason why the timbres converted by the phoneme posterior probability graph method and the variable molecular coder method are not similar is that there is no way to ensure strict decoupling in the information extraction process, i.e. it cannot be ensured that the extracted speaker irrelevant information does not contain the information of the original speaker, so that the timbre of the speaker to be transferred to the target speaker will be deviated. The method based on the countertraining cannot ensure that the model can be converged certainly due to the particularity of the GAN network, so that the training is very unstable, and the method is difficult to adapt to all data types.
There is therefore a need to provide a solution that provides robust speech conversion techniques.
Disclosure of Invention
The application relates to a speech conversion technology based on semi-parallel corpora.
According to an aspect of the present application, there is provided a method for training a speech conversion model, comprising: in a TTS (speech synthesis) pre-training stage, determining initialization network parameters of a VC decoder and a reference encoder by training a TTS encoder, the VC decoder and the reference encoder by using text and acoustic feature data of a speaker; in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.
According to another aspect of the present application, there is provided a speech conversion system comprising a TTS encoder, a reference encoder, a VC encoder, and a VC decoder, wherein the speech conversion system is configured to: in a TTS pre-training stage, determining initialization network parameters of the VC decoder and the reference encoder by training the TTS encoder, the VC decoder and the reference encoder using speaker text data; in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker to determine final network parameters of the VC encoder, the reference encoder and the VC decoder which are pre-trained.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
In order to describe the manner in which the above-recited and other advantages and features of the invention can be obtained, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
FIG. 1 is an example block diagram of a speech conversion platform according to one embodiment of the present application.
FIG. 2 is an example flow of a method for training a speech conversion model according to one embodiment of the present application.
Detailed Description
As mentioned above, both the existing parallel corpus based speech conversion techniques and the non-parallel corpus based speech conversion techniques have their own drawbacks.
In order to solve the data problem in the speech conversion technology based on parallel corpora, the most direct method is to record a large amount of parallel corpora at a high cost so as to meet the requirement of training from a sequence to a sequence model. On the other hand, for the non-parallel based speech conversion technology, the most common method is to fine-tune the model training against training or information bottleneck through a large amount of network parameter adjustment. However, either of the above solutions requires complex additional steps to ensure accurate and stable speech conversion. This typically means that the user must spend more expensive hardware costs, heavier training processes, and more latency to obtain the desired conversion results. However, the voice conversion itself is a technology with high real-time requirement, so the above solutions can not really meet the requirement of the voice conversion technology.
Therefore, the application provides a speech conversion technology based on semi-parallel corpora, which is a scene closer to the real situation. Moreover, the technology is also based on sequence to sequence, so that the current mainstream TTS technology can be fully used for reference, and the model parameters of the VC are initialized by using the best TTS model structure, thereby greatly reducing the requirements of the model on parallel data. Moreover, training on parallel data in the final stage is actually a small data adaptive method, which can sufficiently reduce the requirement of data volume, and training and convergence are very fast.
In general, the robust semi-parallel corpus-based speech conversion technique proposed by the present application mainly includes the following three stages:
in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker;
in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and
in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.
Therefore, the decoder in the speech synthesis model is pre-trained by using non-parallel training data, the pre-trained decoder is used for further training the encoder of the acoustic feature, and finally a small amount of parallel linguistic data is used for migrating from the initialization stage to the speech conversion task to perform adaptive learning, so that the problem of data volume required by training from the sequence to the sequence model is solved.
Meanwhile, by using the independent duration model to control the output characteristics after voice conversion, the method and the device improve the semantic error problem in the conversion method based on the attention mechanism to a certain extent and improve the conversion accuracy. In particular, from the perspective of a speech conversion scheme of non-parallel corpora, since a real time duration is introduced in the training process to help information compression and extraction, rather than a KL divergence or fixed length downsampling method, it can ensure the stability of information extraction and the quality of the finally converted sound.
With the above overview in mind, an exemplary block diagram of a speech conversion platform according to one embodiment of the present application is described below with reference to FIG. 1.
As shown in the figure, first, the voice conversion platform mainly includes data sources 101(1), (101), (2), (…), and 101(n) for providing various data, and a voice conversion system 110. The speech conversion system 110 obtains the desired data from various data sources via, for example, a wireless or wired connection, depending on the operational requirements. The wireless connection may include a network, such as the internet, WLAN, cellular networks, and bluetooth, NFC, WIFI, and the like. The wired connection may include a cable, a USB connection, a type-C connection, and the like. Of course, if necessary, various data may also be input into the voice conversion system 110 through, for example, a removable storage medium (e.g., a floppy disk, a hard disk, an optical disk, a U-disk), and the like.
For example, in an initial stage, when the speech conversion model in the speech conversion system 110 needs to be trained, TTS pre-training may be performed on the speech conversion model by first obtaining text of one or more speakers (e.g., a speech script of the speaker) from the data source 101 (1); the audio data of the original speaker can then be retrieved from another data source 101(2) for input to a speech conversion model for VC (pre) training. While in the use phase, real-time speech data of the original speaker that needs to be converted (e.g., real-time speech data obtained from a microphone) may be obtained from the data source 101(n) to perform speech conversion to the speech data of the target speaker. Thus, depending on the task, the speech conversion system may interface with the corresponding data source to obtain the desired input data.
Next, as shown, the speech conversion system 110 may include an input module 111, a speech conversion model, and an output module 117. The voice conversion model mainly comprises: TTS encoder 112, VC encoder 113, duration extension module 114, VC decoder 115, and reference encoder 116. These portions are in data communication with each other via wired or wireless connections.
The input module 111 is primarily configured to receive various data required by the speech conversion model from a data source.
The TTS encoder 112 is primarily configured to encode output context information from the received text sequence of the original speaker during the TTS training phase, the context information providing semantic information.
VC encoder 113 is primarily configured to encode, during a VC (pre) training phase, output context information based on received acoustic features of the original speaker (the acoustic features being extracted from the audio data of the original speaker), the context information providing semantic information.
The duration module 114 is primarily configured to up-sample a text-length sequence of context vectors to the same length as the acoustic feature length or down-sample the acoustic feature sequence to the length of the text sequence to reflect prosodic information of the speaker. The duration module 114 may include an ASR model that extracts the true duration from the speech data to provide prosody information during the training phase and a duration prediction network that uses the predicted duration to provide prosody information during the testing phase.
VC decoder 115 is primarily configured to decode context information from the encoder back into acoustic features.
The reference coder 116 is primarily configured to extract the speaker's timbre information from the speaker's reference audio data to facilitate distinguishing between different speakers.
The output module 117 is mainly configured to output the audio data of the target speaker converted by the speech conversion model to the user, for example, by providing the audio data to a speaker, a loudspeaker, a headphone, or the like to directly play the audio data.
It should be appreciated that in order to provide good voice conversion services, the voice conversion model must be trained and tested before formal voice conversion is performed. Only a properly trained speech conversion model can provide satisfactory conversion results. The scheme of the application aims at improving the training scheme of the voice conversion model.
With the system architecture of the speech conversion system of the present application in mind, an example flow of a method for training a speech conversion model in accordance with one embodiment of the present application is described below in conjunction with FIG. 2.
Before beginning to introduce the training scheme, some basic concepts in down-conversion of speech are introduced to facilitate understanding by the skilled person.
Basic concept
A speech contains various types of information, and according to the characteristics of speech and pronunciation of people, it can be assumed that the speech at least contains three types of information: semantic information (content), speaker timbre information (time), and prosodic information (rhythm). In general, a triplet (c, t, r) may be used. Wherein c, t and r can respectively represent semantic information, tone information and prosodic information. For the sake of distinction, they may also be respectively configured with subscripts "src" or "trg", where "src" indicates that the information is from the original speaker and "trg" indicates that the information is from the target speaker. For the speech conversion task, it is a kind of original speaker's speech to be inputted (c) src ,t src ,r src ) Timbre information t of middle speaker src The triplets are stripped and replaced with the timbre information t of the target speaker trg While retaining semantic information c src Thus constructing a new triplet (c) src ,t trg ,r avg ) To implement the voice conversion process. Among them, there is generally no strict regulation on the prosody information that the prosody of the original speaker must be used or the prosody of the target speaker must be used. Here, according to the usual practice, a certain average prosody r can be assumed avg Experiments were performed. The average prosody may be a statistical average of prosody information of a plurality of persons. Under such requirements, the most important requirements for a good sound conversion system are: in preserving original speaker semantics c src On the premise of removing the tone color information t of the original speaker as much as possible src And by increasing the tone information t of the target speaker trg High quality acoustic features are restored.
Based on the basic idea of the voice conversion, the application designs a new training method of the voice conversion model based on the semi-parallel corpora, and in summary, the voice conversion model from the sequence to the sequence is pre-trained through a large amount of non-parallel corpora, and then the pre-trained voice conversion model is self-adaptively trained by using a small amount of parallel corpora, so that the high-quality voice conversion model can be quickly obtained.
First, as illustrated in FIG. 1, the speech conversion model can be generally divided into four parts: the device comprises an encoder, a reference encoder, a duration module and a decoder. This structure can describe both the framework for speech synthesis and the framework for voice conversion. In this structure, to distinguish between two different tasks, TTS and VC, the encoder and decoder can be further subdivided by task name, according to the modeled characteristics: a TTS encoder 112 that encodes text into context information, a VC encoder 113 that encodes speech features into context information, and a VC decoder 115 that decodes the context information back into speech features. Note that no TTS decoder is mentioned here, since text-based self-coders are of little significance for synthesis or voice conversion.
As shown in fig. 2, the training method of the speech conversion model may be divided into three stages, namely, a TTS pre-training stage, a VC pre-training stage, and a VC training stage.
Firstly, in a TTS pre-training stage, supposing that for a certain training speaker spk, an input module receives text data of the speaker from a corresponding data source, and a TTS encoder encodes and outputs an encoder hidden layer H containing text context information according to the received text data tts Which provides semantic information c spk . Meanwhile, the voice conversion model may receive the speaker's timbre information t, which is output from the reference encoder based on the reference feature encoding spk Adding prosodic information r provided in the duration information obtained from the duration model spk It can pass through the triplet (c) spk ,t spk ,r spk ) The acoustic features of the speaker spk are reconstructed, training the TTS encoder and VC decoder networks. Therefore, through TTS pre-training, a conventional multi-speaker TTS model based on an independent duration model can be trained, and meanwhile, the fact that a VC decoder subjected to the pre-training can synthesize corresponding high-quality audio when a correct hidden layer representation H is given is also ensured.
And secondly, initializing and fixing network parameters of the VC decoder in a VC pre-training stage. Because the voice conversion model is pre-trained by TTS with a large amount of non-parallel linguistic data such as text before initializing the VC decoder, the initialization of the VC decoder in the VC pre-training stage is not random any more, but relates to a relatively optimized range, thereby greatly reducing the requirement on parallel data in subsequent model training.
Then, the VC coder is trained in a self-coder mode. Similar to the previous step, the VC coder receives the acoustic features O of the speaker spk from the data source via the input module spk As input, and encodes the output encoder hidden layer H vc The hidden layer provides semantic information c spk . And the output of the reference encoder provides the speaker's voice of the speakerColor information t spk And, in addition, prosodic information r provided by the duration information obtained from the duration module spk It can pass through the triplet (c) spk ,t spk ,r spk ) The acoustic features of the speaker spk are reconstructed. The fixation of VC decoder will make VC coder forcedly code input characteristics into hidden layer representation with same output distribution as TTS coder, i.e. let H tts And H vc The distribution of (a) is kept consistent. While keeping the distribution consistent has two benefits, one is to ensure H vc Contains enough semantic information to reconstruct high-quality acoustic features, which is due to H tts Is the result of the pure text input and must not contain the speaker's timbre information, thereby realizing the constraint H vc The tone information of the speaker is not extracted, and the effect of leakage of the speaker information is reduced.
In another embodiment, such a training process is still possible without a fixed VC decoder. However, this cannot guarantee the output H of the VC encoder vc Must not contain tone information t spk Thus, a problem of leakage of speaker information from the VC encoder, which is detrimental to the VC training of the third step, may occur.
And thirdly, in the VC training stage, on the basis of initializing the VC encoder and the VC decoder obtained by the pre-training of the previous two steps, quickly training the pre-trained VC decoder, the VC encoder and the reference encoder again by using a small amount of parallel linguistic data so as to obtain a corresponding voice conversion model.
The pre-training-based method introduces more prior knowledge, and greatly reduces the data volume requirement of the sequence-to-sequence speech conversion model, so that the good speech conversion model can be obtained by training more easily and quickly.
For the specific model selection of the encoder and decoder involved in the above scheme, one of the most popular TTS model structures can be used: and a transform-based coder and a transform-based decoder. The Transformer is a novel network structure which replaces a traditional sequence modeling network (such as LSTM) and can sufficiently extract the context information of a sequence through self-attention calculation and a multi-attention mechanism. It should be understood that the transform-based encoder and decoder is only one example of a specific model, and the model is not limited to a transform. In fact, many other models are equally applicable to the speech conversion scheme of the present application.
Meanwhile, in order to increase the amount of data for model pre-training, in some embodiments, training data of multiple speakers may be introduced, and speaker embedded information (i.e., tone information of the speakers) is to distinguish different speakers, so that the encoder can fully learn important input parameters of speaker independent information. The method for extracting the speaker embedded information at the phoneme level based on the attention mechanism is used in the scheme, the corresponding relation between the output of the encoder and the reference audio is calculated for the input reference audio, the speaker embedded information at the phoneme level is extracted, the extracted speaker embedded information is spliced on the output of the encoder, and then the speaker embedded information is input into the decoder. For the reference features input to the reference encoder, in the training phase, a plurality of sentences randomly selected from all sentences of the current speaker are used (the number of sentences is determined by the size of a video memory); in the testing stage, all sentences of the target speaker are input as much as possible so as to achieve a better tone extraction effect.
In addition to providing an improvement of pre-training prior to formally training the speech conversion model, the present application also provides an improved duration model.
As described above, although the prosody information r in the information hypothesis is not strictly limited at the time of voice conversion, the prosody information r actually has some influence on the voice conversion system. Incomplete or inaccurate extraction of prosodic information from the original speaker may result in semantic errors in the converted speech. And in some specific situations, it may also be desirable to convert the prosodic information of the speaker during the speech conversion process so that the converted audio has the speaking rhythm of the target speaker.
In order to solve the series of problems, the application introduces a separate duration model in the speech conversion model. The duration model and prosodic information are highly correlated, which models the length of the acoustic feature corresponding to a word in the current context. Therefore, we extract prosody information from the input speech through an independent duration model and input new prosody information to re-synthesize audio upon reconstruction or conversion.
Generally speaking, in the training phase, the duration information of the training data can be obtained through the pre-trained ASR model in the duration module, and in the testing phase, the duration may not be directly obtained, so that similarly in TTS, a duration prediction network implemented by a bidirectional LSTM network is also trained at the same time to provide the duration information.
The duration module is formed by applying the duration model to the current sequence-to-sequence model.
The duration model has two important functions: up-sampling and down-sampling. Upsampling refers to expanding a text sequence into a sequence of feature lengths (which may also be referred to as "duration expansion"), while downsampling refers to downsampling a feature sequence into a sequence of text lengths.
During up-sampling, a context vector sequence with a text length is repeatedly expanded into a sequence with the same length as the acoustic characteristic length, and information such as position coding and the like is added. This operation is a very common structure in end-to-end speech synthesis, and by the up-sampling, good stability can be brought in the TTS task, which helps to improve the synthesis instability caused by the defect of too small amount of parallel data in the VC task, and improve the naturalness of the finally converted audio.
During down-sampling, the acoustic feature sequence is subjected to time length segmentation, and each variable-length feature segment is converted into a variable with a fixed length through pooling operation, so that the feature sequence with the same length as the text is formed.
Similarly, the roles of the duration model in the various stages are described below in conjunction with the three stages of training the speech conversion model described above.
The first step is as follows: in the TTS pre-training stage, the text and the corresponding duration naturally separate the prosodic information, so that the duration model only needs to carry out H output by a TTS encoder tts Upsampling (instant long expansion) is performed.
The second step is that: in the VC pre-training stage, the time length model firstly performs down-sampling on the output of the VC coder, and the prosodic information is also separated so as to carry out H output by the VC coder vc Down-sampled to the text length and then up-sampled.
The third step: in the VC training phase, the duration model performs the same operation as the second step.
In some embodiments, if prosodic conversion is also considered in the speech conversion, the duration prediction network of the target speaker may be trained to obtain duration information that conforms to the prosodic rhythm of the target speaker for upsampling.
Having described an exemplary flow of a method for training a speech conversion model, for ease of understanding, a specific flow of the method is described below in connection with a specific application example.
First, two data sets are agreed: small parallel corpus data set D VC ={D src ,D trg And a corpus of multiple speakers non-parallel corpus data sets D TTS ={D spk1 ,D spk2 … }. Where all data for a speaker spk can be expressed as (text) spk ,audio spk ) The set of data pairs. The audio data is represented as O after extraction of the acoustic features. src and trg represent the original speaker and the target speaker, respectively, that ultimately needs to be converted. Therefore, the parallel corpus data set after acoustic feature extraction can be represented as { O { src ,O trg }. To accomplish training and testing of the relevant duration module, the duration information, denoted as u, may be extracted using a pre-trained ASR model spk Finally, the data set for each speaker can be represented as D spk ={(text spk ,O spk ,u spk )}。
As mentioned above, the main idea of the whole training process is to use a large amount of easily available non-parallel corpus data D in the pre-training stage TTS To learn an initialized network parameter (or parameter range) as an initial parameter of the speech conversion model. Then, a small limited parallel corpus data set D is used VC Performing on a speech conversion modelAnd performing self-adaptive adjustment to obtain a final voice conversion model. Compared with the traditional scheme of training the voice conversion model by directly adopting random parameters, the pre-trained voice conversion model can greatly reduce the requirement of the model on parallel data during subsequent formal training. The training of the whole model is divided into the following three steps
1. TTS pre-training stage: use of D TTS Training a multi-speaker TTS model by a data set to finish the pre-training of a VC decoder, wherein the TTS pre-training specifically comprises the following operations:
a) and randomly initializing network parameters of the TTS encoder, the VC decoder and the reference encoder.
b) Will D TTS Text sequence text of speaker spk in dataset spk The processed front text is input into a TTS encoder, and the output H of the TTS encoder is obtained through encoding processing tts
c) Inputting the reference features of the reference audio based on the speaker spk into a reference encoder to obtain speaker embedded information (such as tone color information in a representation variable form), and comparing the speaker embedded information with the output H of a TTS encoder tts Dimension splicing (also called variable matrix splicing) is carried out to obtain H tts '。
d) Outputting H to spliced TTS encoder through duration module tts ' Up-sampling to get an up-sampled TTS encoder output H extend
e) Up-sampled H extend Input into VC decoder network for decoding to obtain predicted result
Figure BDA0002831223420000101
f) Acoustic features O for speakers in a set of data pairs spk And predicting the result
Figure BDA0002831223420000102
Calculating the Error between them using Mean Square Error (MSE) loss criterion, passing back the gradient, and using a training strategy with initial learning rate of 1e-3 and Noam learning rate decay criterion, and Adam optimizer, updatingNetwork parameters of the TTS encoder, the VC decoder and the reference encoder until the error value converges. After the TTS pre-training is completed, the TTS encoder is discarded, and the trained model parameters of the VC decoder and the reference encoder are reserved as initialization network parameters.
g) Text spk Input to the duration prediction network to output the duration prediction value
Figure BDA0002831223420000103
h) Calculating duration information u spk Sum duration prediction
Figure BDA0002831223420000111
And the error between the time length and the time length is transmitted back to the gradient, and the network parameters of the time length prediction model are updated until convergence.
2. VC pre-training stage: network initialization and fixing are performed using the VC decoder model parameters trained in the first step, and then D is used TTS All audio data in the data set O spk The VC pre-training specifically comprises the following operations:
a) and performing network initialization by using the VC decoder trained in the first step and the reference encoder parameters, and randomly initializing the network parameters of the VC encoder.
b) Acoustic feature O of speaker spk spk Inputting VC coder for coding, and down-sampling coding result by time length module to obtain down-sampled VC coder output H vc1
c) Inputting reference characteristics of reference audio based on speaker spk into a reference encoder to obtain speaker embedded information, and performing dimension splicing on the speaker embedded information and the output of a down-sampled VC encoder to obtain the output H of the spliced VC encoder vc1 '。
d) Outputting H to the spliced VC encoder through the time length module vc1 ' upsampling to obtain an upsampled VC encoder output H extend1
e) Up-sampled H extend1 Input into VC decoder network for decoding to obtain predicted result
Figure BDA0002831223420000112
f) Acoustic features O for a speaker spk And predicting the result
Figure BDA0002831223420000113
And calculating the Error between the average Square Error (MSE) and the Mean Square Error (MSE) loss criterion, reversely transmitting the gradient, using the training strategies of the initial learning rate of 1e-3 and the Noam learning rate attenuation criterion, and network parameters of an Adam optimizer, a fixed VC decoder and a reference encoder, and updating the network parameters of the VC encoder until the Error value is converged.
3. And (3) VC training stage: network initialization using VC coder parameters trained in the second step, using slave D VC All acoustic features extracted from all audio data in the data set O src ,O trg Training all model parameters in a self-adaptive mode, wherein the VC training specifically comprises the following operations:
a) and initializing the network parameters of the VC coder trained in the second step.
b) Acoustic features O of original speaker src Inputting VC coder for coding, and down-sampling coding result by time length module to obtain down-sampled VC coder output H vc2
c) Inputting reference characteristics of reference audio based on the trg of the target speaker to a reference encoder to obtain speaker-embedded information, and outputting H with a down-sampled VC encoder vc2 Performing dimension splicing to obtain spliced H vc2 '。
d) Outputting H to the spliced VC encoder through the time length module vc2 ' upsampling to get an upsampled VC encoder output H extend2
e) Outputting H from up-sampled VC coder extend2 Inputting the data into a VC decoder network for decoding to obtain a predicted target result
Figure BDA0002831223420000121
f) Acoustic features O for a targeted speaker trg And predicting the target outcome
Figure BDA0002831223420000122
Calculating the Error between the VC coder and the VC decoder by using a Mean Square Error (MSE) loss criterion, reversely transmitting the gradient, using a training strategy with a fixed learning rate of 1e-4 and an Adam optimizer, and updating all network parameters of the VC coder, the VC decoder and the reference coder until convergence.
The method of training a speech conversion model according to the present application ends.
After training is completed, the trained speech conversion model is typically subjected to a testing phase to test the speech conversion model before being used in a speech conversion task to further improve the conversion performance of the speech conversion model.
The flow in the test phase can also be similarly expressed as: inputting a test audio of an original speaker, performing VC coding and decoding, and outputting a target conversion audio, wherein the specific operations are as follows:
a) and carrying out network initialization on the parameters of the VC decoder, the VC coder and the reference coder trained according to the training method.
b) Test acoustic features O to be extracted from test audio of an original speaker srctest Input to a VC encoder for encoding, and down-sampled by a duration module to obtain a down-sampled VC encoder output H vctest
c) The reference characteristics of a large (as much as possible) reference audio based on the target speaker trg are input to a reference coder to obtain speaker-embedded information, which is then output H with a down-sampled VC coder vctest Performing dimension splicing to obtain output H of spliced VC encoder vctest '。
d) Inputting the test text test of the original speaker by using the trained duration prediction network srctest To obtain predicted duration information
Figure BDA0002831223420000123
e) Using predicted duration information
Figure BDA0002831223420000124
Outputting H to spliced encoder through time length module vctest ' upsampling to obtain an upsampled VC encoder output H extendtest
f) Up-sampled H extendtest Input VC decoder network for decoding to obtain prediction result
Figure BDA0002831223420000125
g) Will be provided with
Figure BDA0002831223420000126
Inputting the vocoder to obtain converted target test audio data
Figure BDA0002831223420000127
In general, the training scheme of the present application based on the speech conversion model of semi-parallel corpora does not require high data size of the original and target speakers, and can even help to generate parallel corpora by directly training a TTS model. The core idea of the scheme is that through pre-training of a self-coding network of TTS and VC, the parallel training data volume between an original speaker and a target speaker can be reduced greatly, the data volume requirement of a speech conversion model from a sequence to a sequence is greatly reduced, and meanwhile, professional audio with high quality and high reliability can be generated.
Although in the above described embodiment separate time duration models are used between the encoder and decoder, it will be appreciated that it is indeed possible to use the more common attention mechanism in sequence-to-sequence tasks, with only a slightly inferior conversion effect. In comparison, due to the particularity of the conversion tasks related to TTS/VC or similar time sequences, the hard alignment mechanism of the independent time length model is used to replace a common attention mechanism, which can help the voice conversion model to synthesize more robust voice and obtain better user experience.
It should be noted that the solution of the present application is mainly an improvement on the training procedure of the speech conversion model, and not an innovation on the codec technology of the encoder and the decoder themselves. Therefore, the data processing methods in the above procedures, such as feature extraction, encoding, decoding, splicing, gradient backward propagation, convergence, etc., can find corresponding contents in the existing voice conversion technology. And will not be specifically described herein. While the Mean Square Error (MSE) loss criterion, the fixed learning rate of 1e-4, and Adam optimizer, etc., described in the above embodiments are merely some example algorithms that may be used to implement the above-described data processing and are not limited thereto. Other algorithms may also be used.
The foregoing describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous. Moreover, it will be understood by those skilled in the relevant art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (7)

1. A method for speech conversion using a speech conversion model, comprising:
1) training the speech conversion model, comprising:
in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker;
in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using the acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and
in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained;
2) converting original speaker information into target speaker information by using a trained voice conversion model;
wherein, in the TTS pre-training phase:
randomly initializing network parameters of a TTS encoder, a VC decoder and a reference encoder;
encoding the text sequence of the speaker into TTS encoder output by the TTS encoder;
encoding reference characteristics based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the output of the TTS encoder;
upsampling the spliced TTS encoder output by a duration module to obtain an upsampled TTS encoder output;
inputting the upsampled TTS encoder output to the VC decoder for decoding into a prediction; and
calculating the error between the acoustic characteristic of the speaker and the prediction result, reversely transmitting the gradient, and updating the network parameters of the VC decoder and the reference encoder until convergence;
wherein, in the VC pre-training phase:
using the network parameters of the VC decoder and the reference encoder trained in a TTS pre-training stage to carry out network initialization and fix, and randomly initializing the network parameters of the VC encoder;
coding the acoustic characteristics of the speaker through the VC coder, and performing down-sampling on a coding result by using a time length module to obtain the down-sampled VC coder output;
encoding reference features based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
upsampling the spliced VC encoder output by the duration module to obtain an upsampled VC encoder output;
inputting an upsampled VC encoder output to the VC decoder for decoding into another prediction result; and
calculating the error between the acoustic characteristic of the speaker and the other prediction result, reversely transmitting the gradient, and updating the network parameters of the VC encoder until convergence;
wherein, in the VC training phase:
performing network initialization using the network parameters of the VC encoder trained in a VC pre-training phase;
coding the acoustic characteristics of the original speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;
encoding reference characteristics of a reference audio based on a target speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
upsampling the spliced VC encoder output by the duration module to obtain an upsampled VC encoder output;
inputting the upsampled VC encoder output to the VC decoder for decoding into a predicted target result; and
and calculating the error between the acoustic characteristic of the target speaker and the predicted target result, reversely transmitting the gradient, and updating the network parameters of the VC encoder, the VC decoder and the reference encoder until convergence.
2. The method of claim 1, further comprising, during said TTS pre-training phase, the steps of:
outputting a duration prediction value by inputting the text sequence of the speaker to a duration prediction network;
and calculating the error between the duration information and the duration prediction value, reversely transmitting the gradient, and updating the network parameters of the duration prediction model until convergence.
3. The method of claim 1, wherein the method further comprises a testing phase in which:
network initialization is carried out on the trained network parameters of the VC decoder, the VC encoder and the reference encoder;
coding the acoustic features extracted from the test audio of the original speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;
encoding reference characteristics of a reference audio based on a target speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
converting the test text of the original speaker into duration information through a trained duration prediction network;
upsampling the spliced VC encoder output by a duration module using the duration information to obtain an upsampled VC encoder output;
inputting the upsampled VC encoder output to the VC decoder for decoding into a predicted target result; and
and inputting the predicted target result into a vocoder to obtain converted target test audio data.
4. The method of claim 3, wherein the speaker comprises a plurality of speakers, and the speaker embedded information is used to distinguish between different speakers.
5. The method of claim 1, wherein the speaker's voice includes three kinds of information according to the characteristics of the speaker's utterance: the VC coder output of the VC coder provides the semantic information, the speaker embedded information of the coder provides the speaker tone information, and the duration module provides the prosodic information.
6. A speech conversion system comprising:
an input module configured to receive various data required for a voice conversion model from a data source;
a speech conversion model comprising a TTS encoder, a reference encoder, a VC encoder, and a VC decoder, wherein the speech conversion model is configured to:
in a TTS pre-training phase, determining initialized network parameters of the VC decoder and the reference encoder by training the TTS encoder, the VC decoder and the reference encoder using text data of a speaker;
in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and
in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic features of an original speaker to determine final network parameters of the VC encoder, the reference encoder and the VC decoder which are pre-trained;
wherein, in the TTS pre-training phase:
randomly initializing network parameters of a TTS encoder, a VC decoder and a reference encoder;
encoding the text sequence of the speaker into TTS encoder output by the TTS encoder;
encoding reference characteristics based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the output of the TTS encoder;
upsampling the spliced TTS encoder output by a duration module to obtain an upsampled TTS encoder output;
inputting an upsampled TTS encoder output to the VC decoder for decoding into a prediction; and
calculating the error between the acoustic characteristic of the speaker and the prediction result, reversely transmitting the gradient, and updating the network parameters of the VC decoder and the reference encoder until convergence;
wherein, in the VC pre-training phase:
using the network parameters of the VC decoder and the reference encoder trained in the TTS pre-training stage to perform network initialization and fix, and randomly initializing the network parameters of the VC encoder;
coding the acoustic characteristics of the speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;
encoding reference features based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
upsampling the spliced VC encoder output by the duration module to obtain an upsampled VC encoder output;
inputting an upsampled VC encoder output to the VC decoder for decoding into another prediction result; and
calculating an error between the acoustic characteristic of the speaker and the other prediction result, reversely transmitting a gradient, and updating a network parameter of the VC coder until convergence;
wherein, in the VC training phase:
performing network initialization using the network parameters of the VC encoder trained in a VC pre-training phase;
coding the acoustic characteristics of the original speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;
encoding reference characteristics of a reference audio based on a target speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;
upsampling the spliced VC encoder output by the duration module to obtain an upsampled VC encoder output;
inputting the upsampled VC encoder output to the VC decoder for decoding into a predicted target result; and
calculating the error between the acoustic characteristic of the target speaker and the predicted target result, reversely transmitting the gradient, and updating the network parameters of the VC encoder, the VC decoder and the reference encoder until convergence; and
and the output module is configured to output the target speaker information converted by the voice conversion model.
7. The speech conversion system of claim 6, wherein the duration module is configured to up-sample a text-length sequence of context vectors to a sequence of the same length as the acoustic feature length or down-sample the acoustic feature sequence to the length of the text sequence;
the reference encoder is used for extracting speaker embedded information from reference audio data of speakers so as to distinguish different speakers.
CN202011460130.5A 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus Active CN112530403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011460130.5A CN112530403B (en) 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011460130.5A CN112530403B (en) 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus

Publications (2)

Publication Number Publication Date
CN112530403A CN112530403A (en) 2021-03-19
CN112530403B true CN112530403B (en) 2022-08-26

Family

ID=74999231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011460130.5A Active CN112530403B (en) 2020-12-11 2020-12-11 Voice conversion method and system based on semi-parallel corpus

Country Status (1)

Country Link
CN (1) CN112530403B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345452B (en) * 2021-04-27 2024-04-26 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model
CN113450759A (en) * 2021-06-22 2021-09-28 北京百度网讯科技有限公司 Voice generation method, device, electronic equipment and storage medium
CN113436609B (en) * 2021-07-06 2023-03-10 南京硅语智能科技有限公司 Voice conversion model, training method thereof, voice conversion method and system
CN113781996B (en) * 2021-08-20 2023-06-27 北京淇瑀信息科技有限公司 Voice synthesis model training method and device and electronic equipment
CN115910002B (en) * 2023-01-06 2023-05-16 之江实验室 Audio generation method, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020027619A1 (en) * 2018-08-02 2020-02-06 네오사피엔스 주식회사 Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN112037754A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Method for generating speech synthesis training data and related equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020027619A1 (en) * 2018-08-02 2020-02-06 네오사피엔스 주식회사 Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN112037754A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Method for generating speech synthesis training data and related equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《增强变分自编码器做非平行语料语音转换》;黄国捷等;《信号处理》;20181031;第34卷(第10期);第1246-1251页 *

Also Published As

Publication number Publication date
CN112530403A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
CN112530403B (en) Voice conversion method and system based on semi-parallel corpus
US11587569B2 (en) Generating and using text-to-speech data for speech recognition models
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
CN109147758B (en) Speaker voice conversion method and device
CN111754976B (en) Rhythm control voice synthesis method, system and electronic device
JP6989951B2 (en) Speech chain device, computer program and DNN speech recognition / synthesis mutual learning method
KR20220004737A (en) Multilingual speech synthesis and cross-language speech replication
KR20230003056A (en) Speech recognition using non-speech text and speech synthesis
JP7152791B2 (en) Crosslingual speech conversion system and method
KR20230156121A (en) Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
Luong et al. Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech
CN111508470A (en) Training method and device of speech synthesis model
KR20230084229A (en) Parallel tacotron: non-autoregressive and controllable TTS
CN112530400A (en) Method, system, device and medium for generating voice based on text of deep learning
CN112002302B (en) Speech synthesis method and device
CN112185340B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
KR20220134347A (en) Speech synthesis method and apparatus based on multiple speaker training dataset
Zhang et al. Learning singing from speech
JP6542823B2 (en) Acoustic model learning device, speech synthesizer, method thereof and program
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
Gong et al. ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
CN115206281A (en) Speech synthesis model training method and device, electronic equipment and medium
Heymans et al. Efficient acoustic feature transformation in mismatched environments using a Guided-GAN
CN114333900B (en) Method for extracting BNF (BNF) characteristics end to end, network model, training method and training system
WO2023102932A1 (en) Audio conversion method, electronic device, program product, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant