CN112530403A

CN112530403A - Voice conversion method and system based on semi-parallel corpus

Info

Publication number: CN112530403A
Application number: CN202011460130.5A
Authority: CN
Inventors: 吴梦玥; 徐志航; 陈博
Original assignee: Guangming Daily; Shanghai Jiaotong University
Current assignee: Guangming Daily; Shanghai Jiaotong University
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-19
Anticipated expiration: 2040-12-11
Also published as: CN112530403B

Abstract

The present disclosure relates to a scheme for training a speech conversion model, comprising: in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker; in the VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using the acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.

Description

Voice conversion method and system based on semi-parallel corpus

Technical Field

The present application relates to the field of voice conversion, and in particular, to a method and system for voice conversion based on semi-parallel corpora.

Background

Voice Conversion (VC) refers to changing original speaker information in a Voice to a specific target speaker by changing the tone and pitch of the Voice without changing semantic information in the Voice. The voice conversion technology is widely applied to the field of voice signal processing, and particularly has very wide application prospects in the fields of personalized voice synthesis, pronunciation assistance, voice enhancement, multimedia entertainment and the like. With the maturity of the deep neural network, the voice conversion also comprehensively enters the neural network era, and the conversion performance is obviously improved.

According to different training data conditions, the speech conversion can be divided into parallel corpus based speech conversion and non-parallel corpus based speech conversion, wherein the parallel corpus based speech conversion generally means that training corpuses of an original speaker and a target speaker have the same text content, and the non-parallel corpus based speech conversion does not have the same text corpus condition.

The speech conversion technology based on parallel corpora is divided into two types:

1. parallel corpora with different lengths are converted into parallel corpora with the same length through dynamic time warping, and then a conversion network is trained through some sequence methods with fixed modeling length, such as DNN, LSTM and the like.

2. By using a sequence-to-sequence (sequence-to-sequence) conversion method, the model learns the relation between the original characteristic sequence and the target characteristic sequence through an attention mechanism, thereby realizing the modeling of the dynamic length.

There are three different lines of speech conversion technology based on non-parallel corpora:

1. phoneme posterior probability graph method (Phonetic PosteriorGrams, PPGs)

The core idea of this approach is to use a speaker independent feature as an intermediate feature to mediate between the original and target acoustic features. The intermediate features can be extracted from the voice of any original speaker through the extractor of the speaker independent features, and then the voice conversion can be realized only by training a mapping model from the speaker independent features to the acoustic features of the target speaker. The most intuitive speaker independent feature is a text feature, so the text uses the phoneme posterior probability map corresponding to each frame as an intermediate feature, and uses an Automatic Speech Recognition (ASR) system as an extractor of the feature.

2. Counter training method

The countermeasure training method mainly refers to a series of works represented by a Cycle-consistency general adaptive network (CycleGAN). A voice conversion method based on CycleGAN is proposed in 2017, the method is based on dual learning and comprises two generation models which are dual with each other, the two dual models are connected in series, two cycles can be obtained to reconstruct characteristics, meanwhile, a discriminator is added to restrain a reconstructed intermediate result, and unsupervised training is achieved. In the testing stage, only one generator in the four models is needed as a conversion model, and the conversion process has no essential difference from a standard voice conversion method.

3. Variational self-encoder method

The Variational Auto Encoder (Variational Auto Encoder) is divided into two models of an Encoder and a decoder, wherein the Encoder converts input acoustic features into speaker-independent hidden variables, and then the hidden variables are restored into the input of the Encoder through the decoder. The voice conversion method based on VAE is based on the assumption of information extraction: each frame of acoustic features contains speaker information and speaker independent information, and the encoder can extract as much speaker independent information as possible from each frame of acoustic feature vector, and the KL divergence constraint in the VAE is, in essence, a constraint that attempts to remove speaker information from the acoustic features.

However, whatever the voice conversion technique described above, there are its own drawbacks, as follows:

from the task perspective, the sequence-to-sequence conversion method generally requires more training data, the cost and difficulty of collecting parallel corpora are often high, it is difficult to collect a large amount of parallel corpora, and it is not practical in the actual use process. Secondly, the attention mechanism-based conversion method in the parallel corpus method is prone to semantic errors due to instability of the attention mechanism.

In the non-parallel corpus method, the phoneme posterior probability graph method and the variable molecular coder method are realized based on the ideas of decoupling and information extraction, so that the information leakage condition is easy to occur, and the converted timbres are not similar. Based on the method of countertraining, the training of GAN is unstable due to the particularity of its model structure.

These disadvantages may be caused by various reasons, for example:

semantic errors of the attention-based conversion method in the parallel corpus method are mainly caused by the instability of attention itself.

In the non-parallel corpus method, the reason why the timbres converted by the phoneme posterior probability graph method and the variable molecular coder method are not similar is that there is no way to ensure strict decoupling in the information extraction process, i.e. it cannot be ensured that the extracted speaker irrelevant information does not contain the information of the original speaker, so that the timbre of the speaker to be transferred to the target speaker will be deviated. The method based on the countertraining cannot ensure that the model can be converged certainly due to the particularity of the GAN network, so that the training is very unstable, and the method is difficult to adapt to all data types.

There is therefore a need to provide a solution that provides robust speech conversion techniques.

Disclosure of Invention

The application relates to a speech conversion technology based on semi-parallel corpora.

According to an aspect of the present application, there is provided a method for training a speech conversion model, comprising: in a TTS (speech synthesis) pre-training stage, determining initialization network parameters of a VC decoder and a reference encoder by training a TTS encoder, the VC decoder and the reference encoder by using text and acoustic feature data of a speaker; in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.

According to another aspect of the present application, there is provided a speech conversion system comprising a TTS encoder, a reference encoder, a VC encoder, and a VC decoder, wherein the speech conversion system is configured to: in a TTS pre-training stage, determining initialization network parameters of the VC decoder and the reference encoder by training the TTS encoder, the VC decoder and the reference encoder using speaker text data; in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker to determine final network parameters of the VC encoder, the reference encoder and the VC decoder which are pre-trained.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Drawings

In order to describe the manner in which the above-recited and other advantages and features of the invention can be obtained, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:

FIG. 1 is an example block diagram of a speech conversion platform according to one embodiment of the present application.

FIG. 2 is an example flow of a method for training a speech conversion model according to one embodiment of the present application.

Detailed Description

As mentioned above, both the existing parallel corpus based speech conversion techniques and the non-parallel corpus based speech conversion techniques have their own drawbacks.

In order to solve the data problem in the speech conversion technology based on parallel corpora, the most direct method is to record a large amount of parallel corpora at high cost so as to meet the requirement of training from a sequence to a sequence model. On the other hand, for the non-parallel based speech conversion technology, the most common method is to fine-tune the model training against training or information bottleneck through a large amount of network parameter adjustment. However, either of the above solutions requires complex additional steps to ensure accurate and stable speech conversion. This typically means that the user must spend more expensive hardware costs, heavier training processes, and more latency to obtain the desired conversion results. However, the voice conversion itself is a technology with high real-time requirement, so the above solutions can not really meet the requirement of the voice conversion technology.

Therefore, the application provides a speech conversion technology based on semi-parallel corpora, which is a scene closer to the real situation. Moreover, the technology is also based on sequence to sequence, so that the current mainstream TTS technology can be fully used for reference, and the model parameters of the VC are initialized by using the best TTS model structure, thereby greatly reducing the requirements of the model on parallel data. Moreover, training on parallel data in the final stage is actually a small data adaptive method, which can sufficiently reduce the requirement of data volume, and training and convergence are very fast.

In general, the robust semi-parallel corpus-based speech conversion technique proposed by the present application mainly includes the following three stages:

in a TTS pre-training stage, determining initialization network parameters of a TTS encoder, a VC decoder and a reference encoder by training the TTS encoder, the VC decoder and the reference encoder by using the text and acoustic feature data of a speaker;

in a VC pre-training stage, initializing and fixing network parameters of the VC decoder and the reference encoder, and training the VC encoder by using acoustic characteristics of a speaker to determine the initialized network parameters of the VC encoder; and

in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic characteristics of an original speaker and a target speaker to determine final network parameters of the VC encoder, the VC decoder and the reference encoder which are pre-trained.

Therefore, the decoder in the speech synthesis model is pre-trained by using non-parallel training data, the pre-trained decoder is used for further training the encoder of the acoustic feature, and finally a small amount of parallel linguistic data is used for migrating from the initialization stage to the speech conversion task to perform adaptive learning, so that the problem of data volume required by training from the sequence to the sequence model is solved.

Meanwhile, by using the independent duration model to control the output characteristics after voice conversion, the method and the device improve the semantic error problem in the conversion method based on the attention mechanism to a certain extent and improve the conversion accuracy. In particular, from the perspective of a speech conversion scheme of non-parallel corpora, since a real time duration is introduced in the training process to help information compression and extraction, rather than a KL divergence or fixed length downsampling method, it can ensure the stability of information extraction and the quality of the finally converted sound.

With the above overview in mind, an exemplary block diagram of a speech conversion platform according to one embodiment of the present application is described below with reference to FIG. 1.

As shown in the figure, the voice conversion platform mainly includes data sources 101(1), 101(2), …, 101(n) for providing various data and a voice conversion system 110. The speech conversion system 110 obtains the desired data from various data sources via, for example, a wireless or wired connection, depending on the operational requirements. The wireless connection may include a network, such as the internet, WLAN, cellular networks, and bluetooth, NFC, WIFI, and the like. The wired connection may include a cable, a USB connection, a type-C connection, and the like. Of course, if necessary, various data may also be input into the voice conversion system 110 through, for example, a removable storage medium (e.g., a floppy disk, a hard disk, an optical disk, a U-disk), and the like.

For example, in an initial stage, when the speech conversion model in the speech conversion system 110 needs to be trained, TTS pre-training may be performed on the speech conversion model by first obtaining text of one or more speakers (e.g., a speech script of the speaker) from the data source 101 (1); subsequently, audio data of the original speaker can be obtained from another data source 101(2) for input to a speech conversion model for VC (pre) training. While in the use phase, real-time speech data of the original speaker that needs to be converted (e.g., real-time speech data obtained from a microphone) may be obtained from the data source 101(n) to perform speech conversion to the speech data of the target speaker. Thus, depending on the task, the speech conversion system may interface with the corresponding data source to obtain the desired input data.

Next, as shown, the speech conversion system 110 may include an input module 111, a speech conversion model, and an output module 117. The voice conversion model mainly comprises: TTS encoder 112, VC encoder 113, duration extension module 114, VC decoder 115, and reference encoder 116. These portions are in data communication with each other via wired or wireless connections.

The input module 111 is primarily configured to receive various data required by the speech conversion model from a data source.

The TTS encoder 112 is primarily configured to encode output context information from the received text sequence of the original speaker during the TTS training phase, the context information providing semantic information.

VC encoder 113 is primarily configured to encode, during a VC (pre) training phase, output context information based on received acoustic features of the original speaker (the acoustic features being extracted from the audio data of the original speaker), the context information providing semantic information.

The duration module 114 is primarily configured to up-sample a text-length sequence of context vectors to the same length as the acoustic feature length or down-sample the acoustic feature sequence to the length of the text sequence to reflect prosodic information of the speaker. The duration module 114 may include an ASR model that extracts the true duration from the speech data to provide prosody information during the training phase and a duration prediction network that uses the predicted duration to provide prosody information during the testing phase.

VC decoder 115 is primarily configured to decode context information from the encoder back into acoustic features.

The reference coder 116 is primarily configured to extract the speaker's timbre information from the speaker's reference audio data to facilitate distinguishing between different speakers.

The output module 117 is mainly configured to output the audio data of the target speaker converted by the speech conversion model to the user, for example, by providing the audio data to a speaker, a loudspeaker, a headphone, or the like to directly play the audio data.

It should be appreciated that in order to provide good voice conversion services, the voice conversion model must be trained and tested before formal voice conversion is performed. Only a properly trained speech conversion model can provide satisfactory conversion results. The scheme of the application aims at improving the training scheme of the voice conversion model.

With the system architecture of the speech conversion system of the present application in mind, an example flow of a method for training a speech conversion model in accordance with one embodiment of the present application is described below in conjunction with FIG. 2.

Before beginning the description of the training scheme, some basic concepts in down-conversion of speech are introduced to facilitate understanding by the skilled person.

Basic concept

A speech contains various types of information, and according to the characteristics of speech and pronunciation of people, it can be assumed that the speech at least contains three types of information: semantic information (content), speaker timbre information (time), and prosodic information (rhythm). In general, a triplet (c, t, r) may be used. Wherein c, t and r can respectively represent semantic information, tone information and prosodic information. For ease of distinction, they may also be individually configured with the subscripts "src" indicating that the information originated from the original speaker or "trg" indicating that the information originated from the target speaker. For the speech conversion task, it is a kind of original speaker's speech to be inputted (c)_src,t_src,r_src) Speaker timbre information t in (1)_srcThe triplets are stripped and replaced with the timbre information t of the target speaker_trgWhile preserving semantic information c_srcThus constructing a new triplet (c)_src,t_trg,r_avg) To implement the voice conversion process. Among them, there is generally no strict regulation on the prosody information that the prosody of the original speaker must be used or the prosody of the target speaker must be used. Here, according to the common practice, a certain average prosody r can be assumed first_avgExperiments were performed. The average prosody may be a statistical average of prosody information of a plurality of persons. Under such requirements, the most important requirements for a good sound conversion system are: in preserving original speaker semantics c_srcOn the premise of removing the tone color information t of the original speaker as much as possible_srcAnd by increasing the tone information t of the target speaker_trgHigh quality acoustic features are restored.

Based on the basic idea of the voice conversion, the application designs a new training method of the voice conversion model based on the semi-parallel corpora, and in summary, the voice conversion model from the sequence to the sequence is pre-trained through a large amount of non-parallel corpora, and then the pre-trained voice conversion model is self-adaptively trained by using a small amount of parallel corpora, so that the high-quality voice conversion model can be quickly obtained.

First, as illustrated in FIG. 1, the speech conversion model can be generally divided into four parts: the device comprises an encoder, a reference encoder, a duration module and a decoder. This structure can describe both the framework for speech synthesis and the framework for voice conversion. In this structure, to distinguish between TTS and VC, the encoder and decoder can be further subdivided by task names as follows, depending on the features of the model: a TTS encoder 112 that encodes text into context information, a VC encoder 113 that encodes speech features into context information, and a VC decoder 115 that decodes the context information back into speech features. Note that no TTS decoder is mentioned here, since text-based self-coders are of little significance for synthesis or voice conversion.

As shown in fig. 2, the training method of the voice conversion model can be divided into three stages, namely, a TTS pre-training stage, a VC pre-training stage, and a VC training stage.

Firstly, in a TTS pre-training stage, supposing that for a certain training speaker spk, an input module receives text data of the speaker from a corresponding data source, and a TTS encoder encodes and outputs an encoder hidden layer H containing text context information according to the received text data_ttsWhich provides semantic information c_spk. Meanwhile, the voice conversion model may receive the speaker's timbre information t, which is output from the reference encoder based on the reference feature encoding_spkAdding prosodic information r provided in the duration information obtained from the duration model_spkIt can pass through the triplet (c)_spk,t_spk,r_spk) The acoustic features of the speaker spk are reconstructed, training the TTS encoder and VC decoder networks. Therefore, through TTS pre-training, a conventional multi-speaker TTS model based on an independent duration model can be trained, and meanwhile, the fact that a VC decoder subjected to the pre-training can synthesize corresponding high-quality audio when a correct hidden layer representation H is given is also ensured.

And secondly, initializing and fixing network parameters of the VC decoder in a VC pre-training stage. Because the voice conversion model is pre-trained by TTS with a large amount of non-parallel linguistic data such as text before initializing the VC decoder, the initialization of the VC decoder in the VC pre-training stage is not random any more, but relates to a relatively optimized range, thereby greatly reducing the requirement on parallel data in subsequent model training.

Then, the VC coder is trained in a self-coder mode. Similar to the previous step, the VC encoder receives the acoustic features O of the speaker spk from the data source via the input module_spkAs input, and encodes the output encoder hidden layer H_vcThe hidden layer provides semantic information c_spk. And the speaker timbre information t of the speaker is provided with reference to the output of the encoder_spkAnd adding prosodic information r provided by the duration information obtained from the duration module_spkIt can pass through the triplet (c)_spk,t_spk,r_spk) The acoustic features of the speaker spk are reconstructed. The fixation of VC decoder will make VC coder forcedly code input characteristics into hidden layer representation with same output distribution as TTS coder, i.e. let H_ttsAnd H_vcThe distribution of (a) is kept consistent. While keeping the distribution consistent has two benefits, one is to ensure H_vcContains enough semantic information to reconstruct high-quality acoustic features, which is due to H_ttsIs the result of plain text input and must not contain speaker timbre information, thereby realizing constraint H_vcThe tone information of the speaker is not extracted, and the effect of leakage of the speaker information is reduced.

In another embodiment, such a training process is still possible without a fixed VC decoder. However, this cannot guarantee the output H of the VC encoder_vcMust not contain tone information t_spkThus, a problem of leakage of speaker information from the VC encoder, which is detrimental to the VC training of the third step, may occur.

And thirdly, in the VC training stage, on the basis of initializing the VC encoder and the VC decoder obtained by the pre-training of the previous two steps, quickly training the pre-trained VC decoder, the VC encoder and the reference encoder again by using a small amount of parallel linguistic data so as to obtain a corresponding voice conversion model.

The pre-training-based method introduces more prior knowledge, and greatly reduces the data volume requirement of the sequence-to-sequence speech conversion model, so that the good speech conversion model can be obtained by training more easily and quickly.

For the specific model selection of the encoder and decoder involved in the above scheme, one of the most popular TTS model structures can be used: and a Transformer-based coder and a decoder. The Transformer is a novel network structure which replaces a traditional sequence modeling network (such as LSTM) and can sufficiently extract the context information of a sequence through self-attention calculation and a multi-attention mechanism. It should be understood that the transform-based encoder and decoder is only one example of a specific model, and the model is not limited to a transform. In fact, many other models are equally applicable to the speech conversion scheme of the present application.

Meanwhile, in order to increase the amount of data for model pre-training, in some embodiments, training data of multiple speakers may be introduced, and speaker embedded information (i.e., tone information of the speakers) is to distinguish different speakers, so that the encoder can fully learn important input parameters of speaker independent information. The method for extracting the speaker embedded information at the phoneme level based on the attention mechanism is used in the scheme, the corresponding relation between the output of the encoder and the reference audio is calculated for the input reference audio, the speaker embedded information at the phoneme level is extracted, the extracted speaker embedded information is spliced on the output of the encoder, and then the speaker embedded information is input into the decoder. For the reference features input to the reference encoder, in the training phase, a plurality of sentences randomly selected from all sentences of the current speaker are used (the number of sentences is determined by the size of the video memory); in the testing stage, all sentences of the target speaker are input as much as possible so as to achieve a better tone extraction effect.

In addition to providing an improvement of pre-training prior to formally training the speech conversion model, the present application also provides an improved duration model.

As described above, although the prosody information r in the information hypothesis is not strictly limited at the time of voice conversion, the prosody information r actually has some influence on the voice conversion system. Incomplete or inaccurate extraction of prosodic information from the original speaker may result in semantic errors in the converted speech. And in some specific situations, it may also be desirable to convert the prosodic information of the speaker during the speech conversion process so that the converted audio has the speaking rhythm of the target speaker.

In order to solve the series of problems, the application introduces a separate duration model in the speech conversion model. The duration model and prosodic information are highly correlated, which models the length of the acoustic feature corresponding to a word in the current context. Therefore, we extract prosody information from the input speech through an independent duration model and input new prosody information to re-synthesize audio upon reconstruction or conversion.

Generally speaking, in the training phase, the duration information of the training data can be obtained through the pre-trained ASR model in the duration module, and in the testing phase, the duration may not be directly obtained, so that similarly in TTS, a duration prediction network implemented by a bidirectional LSTM network is also trained at the same time to provide the duration information.

The duration module is formed by applying the duration model to the current sequence-to-sequence model.

The duration model has two important functions: up-sampling and down-sampling. Upsampling refers to expanding a text sequence into a sequence of feature lengths (which may also be referred to as "duration expansion"), while downsampling refers to downsampling a feature sequence into a sequence of text lengths.

During up-sampling, a context vector sequence with a text length is repeatedly expanded into a sequence with the same length as the acoustic characteristic length, and information such as position coding and the like is added. This operation is a very common structure in end-to-end speech synthesis, and by the up-sampling, good stability can be brought in the TTS task, which helps to improve the synthesis instability caused by the defect of too small amount of parallel data in the VC task, and improve the naturalness of the finally converted audio.

During down-sampling, the acoustic feature sequence is subjected to time length segmentation, and each variable-length feature segment is converted into a variable with a fixed length through pooling operation, so that the feature sequence with the same length as the text is formed.

Similarly, the roles of the duration model in the various stages are described below in conjunction with the three stages of training the speech conversion model described above.

The first step is as follows: in the TTS pre-training stage, the text and the corresponding duration naturally separate the prosodic information, so that the duration model only needs to carry out H output by a TTS encoder_ttsUpsampling (instant long expansion) is performed.

The second step is that: in the VC pre-training stage, the time length model firstly performs down-sampling on the output of the VC coder, and the prosodic information is also separated so as to carry out H output by the VC coder_vcDown-sampled to the text length and then up-sampled.

The third step: in the VC training phase, the duration model performs the same operation as the second step.

In some embodiments, if prosodic conversion is also considered in the speech conversion, the duration prediction network of the target speaker may be trained to obtain duration information that conforms to the prosodic rhythm of the target speaker for upsampling.

Having described an exemplary flow of a method for training a speech conversion model, for ease of understanding, a specific flow of the method is described below in connection with a specific application example.

First, two data sets are agreed: small parallel corpus data set D_VC＝{D_src,D_trgAnd a corpus of multiple speakers non-parallel corpus data sets D_TTS＝{D_spk1,D_spk2… }. Where all data for a speaker spk can be expressed as { (text)_spk,audio_spk) The set of data pairs. The audio data is represented as O after extraction of the acoustic features. src and trg, respectivelyRepresenting the original speaker and the target speaker that ultimately need to be converted. Thus, the parallel corpus data set after acoustic feature extraction can be represented as { O }_src,O_trg}. To accomplish training and testing of the relevant duration module, the duration information, denoted as u, may be extracted using a pre-trained ASR model_spkFinally, the data set for each speaker can be represented as D_spk＝{(text_spk,O_spk,u_spk)}。

As mentioned above, the main idea of the whole training process is to use a large amount of easily available non-parallel corpus data D in the pre-training stage_TTSTo learn an initialized network parameter (or parameter range) as an initial parameter of the speech conversion model. Then, a small limited parallel corpus data set D is used_VCAnd carrying out self-adaptive adjustment on the voice conversion model to obtain a final voice conversion model. Compared with the traditional scheme of training the voice conversion model by directly adopting random parameters, the pre-trained voice conversion model can greatly reduce the requirement of the model on parallel data during subsequent formal training. The training of the whole model is divided into the following three steps

1. TTS pre-training stage: use of D_TTSTraining a multi-speaker TTS model by a data set to finish the pre-training of a VC decoder, wherein the TTS pre-training specifically comprises the following operations:

a) and randomly initializing network parameters of the TTS encoder, the VC decoder and the reference encoder.

b) Will D_TTSText sequence text of speaker spk in dataset_spkThe processed front text is input into a TTS encoder, and the output H of the TTS encoder is obtained through encoding processing_tts。

c) Inputting the reference features of the reference audio based on the speaker spk into a reference encoder to obtain speaker embedded information (such as tone color information in a representation variable form), and comparing the speaker embedded information with the output H of a TTS encoder_ttsDimension splicing (also called variable matrix splicing) is carried out to obtain H_tts'。

d) Outputting H to the spliced TTS encoder through a duration module_tts' carrying outUpsampling to obtain an upsampled TTS encoder output H_extend。

e) Up-sampled H_extendInput into VC decoder network for decoding to obtain predicted result

f) Acoustic features O for speakers in a set of data pairs_spkAnd predicting the result

The Error between the TTS encoder and the VC decoder is calculated by using a Mean Square Error (MSE) loss criterion, a back propagation gradient, a training strategy of an initial learning rate of 1e-3 and a Noam learning rate attenuation criterion, and an Adam optimizer, and network parameters of the TTS encoder, the VC decoder and the reference encoder are updated until the Error value is converged. After the TTS pre-training is completed, the TTS encoder is discarded, and the trained model parameters of the VC decoder and the reference encoder are reserved as initialization network parameters.

g) Text_spkInput to the duration prediction network to output the duration prediction value

h) Calculating duration information u_spkSum duration prediction

And the error between the time length and the time length is transmitted back to the gradient, and the network parameters of the time length prediction model are updated until convergence.

2. VC pre-training stage: network initialization and fixing are performed using the VC decoder model parameters trained in the first step, and then D is used_TTSAll audio data in the data set O_spkThe VC pre-training specifically comprises the following operations:

a) and performing network initialization by using the VC decoder trained in the first step and the reference encoder parameters, and randomly initializing the network parameters of the VC encoder.

b) Acoustic feature O of speaker spk_spkInputting VC coder for coding, and down-sampling coding result by time length module to obtain down-sampled VC coder output H_vc1。

c) Inputting the reference characteristics of the reference audio based on the speaker spk into a reference encoder to obtain speaker embedded information, performing dimensionality splicing on the speaker embedded information and the output of a VC encoder subjected to down-sampling to obtain the output H of the VC encoder subjected to dimensionality splicing_vc1'。

d) Outputting H to the spliced VC encoder through the time length module_vc1' upsampling to obtain an upsampled VC encoder output H_extend1。

e) Up-sampled H_extend1Input into VC decoder network for decoding to obtain predicted result

f) Acoustic features O for a speaker_spkAnd predicting the result

And calculating the Error between the average Square Error (MSE) and the average Square Error (MSE) loss criterion, reversely transmitting the gradient, using a training strategy of an initial learning rate of 1e-3 and a Noam learning rate attenuation criterion, and network parameters of an Adam optimizer, a fixed VC decoder and a reference encoder, and updating network parameters of the VC encoder until the Error value is converged.

3. And (3) VC training stage: network initialization using VC coder parameters from the second training step, using slave D_VCAll acoustic features extracted from all audio data in the data set O_src,O_trgTraining all model parameters in a self-adaptive mode, wherein the VC training specifically comprises the following operations:

a) and initializing the network parameters of the VC coder trained in the second step.

b) Acoustic features O of original speaker_srcInputting VC coder for coding, and down-sampling coding result by time length moduleTo obtain a down-sampled VC encoder output H_vc2。

c) Inputting reference characteristics of reference audio based on the trg of the target speaker to a reference encoder to obtain speaker-embedded information, and outputting H with a down-sampled VC encoder_vc2Performing dimension splicing to obtain spliced H_vc2'。

d) Outputting H to the spliced VC encoder through the time length module_vc2' upsampling to obtain an upsampled VC encoder output H_extend2。

e) Outputting H from up-sampled VC coder_extend2Inputting the data into a VC decoder network for decoding to obtain a predicted target result

f) Acoustic features O for a targeted speaker_trgAnd predicting the target outcome

Calculating the Error between the average Square Error (MSE) loss criterion and the back propagation gradient by using an average Square Error (MSE) loss criterion, and updating all network parameters of the VC encoder, the VC decoder and the reference encoder until convergence by using a training strategy with a fixed learning rate of 1e-4 and an Adam optimizer.

The method of training a speech conversion model according to the present application ends.

After training is completed, the trained speech conversion model is typically subjected to a testing phase to test the speech conversion model before being used in a speech conversion task to further improve the conversion performance of the speech conversion model.

The flow in the test phase can also be similarly expressed as: inputting the test audio of an original speaker, performing VC encoding and decoding, and outputting a target conversion audio, wherein the specific operations are as follows:

a) and carrying out network initialization on the parameters of the VC decoder, the VC encoder and the reference encoder which are trained according to the training method.

b) Test acoustic features O to be extracted from test audio of an original speaker_srctestInput to a VC encoder for encoding, and down-sampled by a duration module to obtain a down-sampled VC encoder output H_vctest。

c) The reference characteristics of a large (as much as possible) amount of reference audio based on the target speaker trg are input to a reference encoder to obtain speaker-embedded information, which is output H with a down-sampled VC encoder_vctestPerforming dimension splicing to obtain output H of spliced VC encoder_vctest'。

d) Inputting the test text test of the original speaker by using the trained duration prediction network_srctestTo obtain predicted duration information

e) Using predicted duration information

Outputting H to spliced encoder through time length module_vctest' upsampling to obtain an upsampled VC encoder output H_extendtest。

f) Up-sampled H_extendtestInput VC decoder network for decoding to obtain prediction result

g) Will be provided with

Inputting the vocoder to obtain converted target test audio data

In general, the training scheme of the present application based on the speech conversion model of semi-parallel corpora does not require high data size of the original and target speakers, and can even help to generate parallel corpora by directly training a TTS model. The core idea of the scheme is that through pre-training of a self-coding network of TTS and VC, the parallel training data volume between an original speaker and a target speaker can be reduced greatly, the data volume requirement of a speech conversion model from a sequence to a sequence is greatly reduced, and meanwhile, professional audio with high quality and high reliability can be generated.

Although in the above described embodiment separate time duration models are used between the encoder and decoder, it will be appreciated that it is indeed possible to use the more common attention mechanism in sequence-to-sequence tasks, with only a slightly inferior conversion effect. In comparison, due to the particularity of the conversion tasks related to TTS/VC or similar time sequences, the hard alignment mechanism of the independent time length model is used for replacing a common attention mechanism, so that the voice conversion model can be helped to synthesize more robust voice, and better user experience is obtained.

It should be noted that the scheme of the present application is mainly an improvement on the training procedure of the speech conversion model, and not an innovation on the codec technology of the encoder and the decoder themselves. Therefore, the data processing methods in the above procedures, such as feature extraction, encoding, decoding, splicing, gradient backward propagation, convergence, etc., can find corresponding contents in the existing voice conversion technology. And will not be specifically described herein. While the Mean Square Error (MSE) loss criterion, the fixed learning rate of 1e-4, and Adam optimizer, etc., described in the above embodiments are merely some example algorithms that may be used to implement the above-described data processing and are not limited thereto. Other algorithms may also be used.

The foregoing describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous. Moreover, those skilled in the relevant art will recognize that the embodiments can be practiced with various modifications in form and detail without departing from the spirit and scope of the present disclosure, as defined by the appended claims. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A method for training a speech conversion model, comprising:

2. The method of claim 1, wherein, during the TTS pre-training phase:

randomly initializing network parameters of a TTS encoder, a VC decoder and a reference encoder;

encoding the text sequence of the speaker into TTS encoder output by the TTS encoder;

encoding reference characteristics based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the output of the TTS encoder;

upsampling the spliced TTS encoder output by a duration module to obtain an upsampled TTS encoder output;

inputting an upsampled TTS encoder output to the VC decoder for decoding into a prediction; and

and calculating the error between the acoustic characteristic of the speaker and the prediction result, reversely transmitting the gradient, and updating the network parameters of the VC decoder and the reference encoder until convergence.

3. The method of claim 2, further comprising, during said TTS pre-training phase, the steps of:

outputting a duration prediction value by inputting the text sequence of the speaker to a duration prediction network;

and calculating the error between the duration information and the duration prediction value, reversely transmitting the gradient, and updating the network parameters of the duration prediction model until convergence.

4. The method of claim 3, wherein, in the VC pre-training phase:

using the network parameters of the VC decoder and the reference encoder trained in the TTS pre-training stage to perform network initialization and fix, and randomly initializing the network parameters of the VC encoder;

coding the acoustic characteristics of the speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;

encoding reference features based on the reference audio of the speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;

upsampling the spliced VC encoder output by the duration module to obtain an upsampled VC encoder output;

inputting an upsampled VC encoder output to the VC decoder for decoding into another prediction result; and

and calculating the error between the acoustic characteristic of the speaker and the other prediction result, reversely transmitting the gradient, and updating the network parameters of the VC encoder until convergence.

5. The method of claim 4, wherein, in the VC training phase:

performing network initialization using the network parameters of the VC encoder trained in a VC pre-training phase;

coding the acoustic characteristics of the original speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;

encoding reference characteristics of a reference audio based on a target speaker into speaker embedded information through the reference encoder, and splicing the speaker embedded information with the down-sampled VC encoder output;

inputting the upsampled VC encoder output to the VC decoder for decoding into a predicted target result; and

and calculating the error between the acoustic characteristic of the target speaker and the predicted target result, reversely transmitting the gradient, and updating the network parameters of the VC encoder, the VC decoder and the reference encoder until convergence.

6. The method of claim 5, wherein the method further comprises a testing phase in which:

network initialization is carried out on the trained network parameters of the VC decoder, the VC encoder and the reference encoder;

coding the acoustic features extracted from the test audio of the original speaker by the VC coder, and performing down-sampling on the coding result by using a duration module to obtain the down-sampled VC coder output;

converting the test text of the original speaker into duration information through a trained duration prediction network;

upsampling the spliced VC encoder output by a duration module using the duration information to obtain an upsampled VC encoder output;

and inputting the predicted target result into a vocoder to obtain converted target test audio data.

7. The method of claim 6, wherein the speaker can include multiple speakers, and the speaker embedded information can be used to distinguish between different speakers.

8. The method of claim 1, wherein the speaker's voice includes three kinds of information according to the characteristics of the speaker's utterance: the VC coder output of the VC coder can provide the semantic information, the speaker embedded information of the reference coder can provide the speaker tone information, and the duration module can provide the prosodic information.

9. A speech conversion system comprising a TTS encoder, a reference encoder, a VC encoder, and a VC decoder, wherein the speech conversion system is configured to:

in a TTS pre-training stage, determining initialization network parameters of the VC decoder and the reference encoder by training the TTS encoder, the VC decoder and the reference encoder using speaker text data;

in a VC training stage, initializing network parameters of the VC encoder, and training the VC encoder, the VC decoder and the reference encoder by using acoustic features of an original speaker to determine final network parameters of the VC encoder, the reference encoder and the VC decoder which are pre-trained.

10. The speech conversion system of claim 9, wherein the duration module is configured to up-sample a text-length sequence of context vectors to a sequence of the same length as the acoustic feature length or down-sample the acoustic feature sequence to the length of the text sequence;

the reference encoder is used for extracting speaker embedded information from reference audio data of speakers so as to distinguish different speakers.