CN116229932A

CN116229932A - Voice cloning method and system based on cross-domain consistency loss

Info

Publication number: CN116229932A
Application number: CN202211570251.4A
Authority: CN
Inventors: 袁欣; 冯永兵; 陈培华
Original assignee: Weiyin Digital Shanghai Co ltd
Current assignee: Weiyin Digital Shanghai Co ltd
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-06-06

Abstract

The invention relates to a voice cloning method and system based on cross-domain consistency loss, belongs to the technical field of voice cloning, and solves the problem of low sound quality of the existing voice cloning. Collecting audio of a cloned object, obtaining voiceprint characteristics of the cloned object, transmitting the audio into a pre-trained voice posterior diagram model, and obtaining a cloned object voice posterior diagram; training to obtain a source acoustic model based on a corpus and a voice posterior graph model, and acquiring a self-adaptive acoustic model according to a cloned object voice posterior graph and a cross-domain consistency loss transfer learning source acoustic model; training to obtain a robust vocoder based on the corpus and voiceprint characteristics of the corpus sample; obtaining phonemes and prosody of a text to be synthesized, transmitting the phonemes and prosody into a self-adaptive acoustic model to obtain a Mel frequency spectrum to be synthesized, transmitting the Mel frequency spectrum to be synthesized into a robust vocoder to perform voice synthesis, and outputting synthesized audio of a target clone object according to voiceprint characteristics of the selected target clone object. High quality speech cloning is achieved.

Description

Voice cloning method and system based on cross-domain consistency loss

Technical Field

The invention relates to the technical field of voice cloning, in particular to a voice cloning method and system based on cross-domain consistency loss.

Background

Speech cloning refers to the cloning of the sound of a cloned object by using a small amount of audio of the cloned object. In general, speech cloning techniques are capable of generating target audio similar to the pronunciation of a cloned object from any text entered. The higher the pronunciation similarity between the generated audio and the original audio of the cloned object, the higher the naturalness and the intelligibility of the generated audio indicate that the better the effect of voice cloning. With the rapid development of deep learning, a voice synthesis system based on a deep neural network can obtain an effect comparable to human voice under the support of a large amount of high-quality corpus, and people pay more attention to voice cloning technology.

The existing voice cloning systems all need paired data of text and audio, and the application scenes of voice cloning are limited. In the absence of speech-corresponding text, audio is first speech-recognized, typically using A Speech Recognition (ASR) system, to obtain the corresponding text. However, speech recognition systems may suffer from inaccuracy in individual character recognition, which can greatly impact the effectiveness of speech cloning.

In the prior art, the acoustic model is usually constructed by utilizing a large amount of high-quality corpus, and when the transfer learning is performed on a small amount of cloned object audio, the whole acoustic model is also subjected to the transfer learning, so that the self-adaptive parameters are too many, the commercial application is not facilitated, and because the loss function utilized by the model training and the final evaluation scoring are unmatched, whether the acoustic model is trained or not can not be judged by the decline of the loss function in the process of utilizing the transfer learning, and the model is easy to be subjected to under fitting and over fitting.

In the prior art, the vocoder also needs to train by using a large amount of corpus, and the corpus needs to comprise the audios of speakers with different ages, sexes, tone features and languages, and the higher the diversity is, the better the diversity is. However, collecting a diversity of corpora is time-consuming and laborious in reality, and still does not fit fully to the non-emerging speakers in the corpus, which makes the sound quality of the speech cloning system synthesis relatively low.

Disclosure of Invention

In view of the above analysis, the embodiment of the invention aims to provide a voice cloning method and a system based on cross-domain consistency loss, which are used for solving the problems that the voice cloning is low in sound quality and complicated in voice cloning steps caused by the fact that the existing robust vocoder cannot completely fit the characteristics of a speaker which does not exist in a corpus.

In one aspect, the embodiment of the invention provides a voice cloning method based on cross-domain consistency loss, which comprises the following steps:

collecting the audio of a cloned object, acquiring voiceprint characteristics of the cloned object from the audio, and transmitting the audio of the cloned object into a pre-trained voice posterior map model to acquire a cloned object voice posterior map;

training an acoustic model to obtain a source acoustic model based on a corpus and a voice posterior graph model, and transferring the learning source acoustic model according to the cloned object voice posterior graph and cross-domain consistency loss to obtain a self-adaptive acoustic model;

training a vocoder based on the corpus and voiceprint characteristics of the corpus sample to obtain a robust vocoder;

obtaining phonemes and prosody of a text to be synthesized, transmitting the phonemes and prosody into a self-adaptive acoustic model to obtain a Mel frequency spectrum to be synthesized, transmitting the Mel frequency spectrum to be synthesized into a robust vocoder to perform voice synthesis, and outputting synthesized audio of a target clone object according to voiceprint characteristics of the selected target clone object.

Based on further improvements of the above method, the acoustic model includes a phoneme and prosody encoder module, a speech posterior graph encoder module, a duration prediction module, a pitch prediction module, a volume prediction module, and a decoder module; wherein the phoneme and prosody encoder module, the speech posterior graph encoder module and the decoder module are each comprised of a plurality of feedforward layers of transducers; the duration prediction module, the pitch prediction module and the volume prediction module are all composed of a plurality of convolution layers.

Based on a further improvement of the above method, a conditional layer normalization is used in the feed-forward layer of each transducer in the decoder module, the conditional layer normalization comprising a speaker-embedded layer and two linear layers, defined as follows:

where x is the vector before normalization, μ is the mean of x, σ is the variance of x, γ represents the learnable scalar in the conditional level normalization, and β represents the learnable offset in the conditional level normalization.

Based on the further improvement of the method, training an acoustic model to obtain a source acoustic model based on a corpus and a voice posterior graph model comprises the following steps:

transmitting the corpus sample in a pre-trained voice posterior graph model to obtain a voice posterior graph of the corpus sample;

the phonetic posterior diagram of the corpus sample is transmitted to a phonetic posterior diagram encoder module, and the phonemes and prosody of the corpus sample are transmitted to a phonemes and prosody encoder module;

outputting speaker IDs corresponding to the corpus sample by a phoneme and prosody encoder module, sequentially inputting the speaker IDs into a time length prediction module, a pitch prediction module and a volume prediction module, respectively obtaining predicted values of pronunciation time length, pronunciation pitch and pronunciation volume of each phoneme of the corpus sample, and then inputting the predicted values and the speaker IDs into a decoder module to obtain predicted Mel frequency spectrums of the corpus sample;

training to obtain a source acoustic model according to errors between the outputs of the phonetic posterior graph encoder module and the phoneme and prosody encoder module, errors between predicted values and true values of the duration, the pronunciation pitch and the pronunciation volume of each phoneme of the corpus sample, and errors between the predicted mel frequency spectrum and the true mel frequency spectrum.

Based on the further improvement of the method, according to the clone object voice posterior diagram and the cross-domain consistency loss transfer learning source acoustic model, the self-adaptive acoustic model is obtained, which comprises the following steps:

constructing a target acoustic model according to the trained source acoustic model; in the target acoustic model, other parameters solidify in addition to the conditional layer normalization parameters in the feed-forward layer of each transducer in the decoder module;

and taking the cloned object voice posterior graph as a learning sample to be transmitted into a trained source acoustic model and a trained target acoustic model, calculating the distance between the characteristic spaces of decoder modules in the source acoustic model and the target acoustic model, taking the distance as cross-domain consistency loss, taking the loss of the predicted Mel spectrum output by the target acoustic model and the loss of the real Mel spectrum of the learning sample as a loss function, and training the target acoustic model to obtain the self-adaptive acoustic model.

Based on further improvement of the method, the cloning object voice posterior graph is taken as a learning sample to be transmitted into a source acoustic model and a target acoustic model, and the distance between the characteristic spaces of decoder modules in the source acoustic model and the target acoustic model is calculated, which comprises the following steps:

in each layer of the decoder module, each learning sample is sequentially taken out from the current batch, cosine similarity of the current learning sample and other learning samples in the same batch in a characteristic space of a source acoustic model and a characteristic space of a target acoustic model is calculated, probability distribution of the cosine similarity in the source acoustic model and the target acoustic model is calculated through a Softmax network layer, and distance between two probability distributions of the current learning sample in the current layer is obtained through KL divergence;

the distances between the two probability distributions for each layer for all the learning samples in the current batch are summed and averaged as the distances for the feature space of the decoder modules in the current batch source acoustic model and the target acoustic model.

Based on a further improvement of the method, the vocoder is based on a HiFi-GAN model and comprises a generator module and a discriminator module, wherein the generator module comprises transposed convolution layers with different receptive fields, and one-dimensional convolution operation is added in each layer, so that voiceprint features are mapped into voiceprint feature vectors with the same dimension as the Meyer spectrum hidden variable features of the current layer.

Based on the further improvement of the method, based on the voice print characteristics of the corpus and the corpus sample, training the vocoder to obtain a robust vocoder comprises the following steps:

inputting the Mel frequency spectrum of the corpus sample in the vocoder, sequentially adding the corresponding voiceprint feature vector of the voiceprint feature of the corpus sample mapped at the current layer into each layer of Mel frequency spectrum hidden variable features of the generator module, and training to obtain a robust vocoder by taking errors of the generator module, errors of the discriminator module, errors of the Mel frequency spectrum and feature matching errors of each discriminator module as loss functions.

Based on the further improvement of the method, the Mel spectrum to be synthesized is transmitted into a robust vocoder to carry out voice synthesis, and synthesized audio of the target clone object is output according to the voiceprint characteristics of the selected target clone object, wherein the voiceprint characteristic vector of the voiceprint characteristics of the selected target clone object mapped on the current layer is added into each layer of Mel spectrum hidden variable characteristics of the generator in the robust vocoder, and finally the generator in the robust vocoder outputs the synthesized audio of the target clone object.

In another aspect, an embodiment of the present invention provides a voice cloning system based on cross-domain consistency loss, including:

the audio processing module is used for acquiring the audio of the cloned object, acquiring voiceprint characteristics of the cloned object from the audio, transmitting the audio of the cloned object into the pre-trained voice posterior map model, and acquiring a voice posterior map of the cloned object;

the self-adaptive acoustic model training module is used for training the acoustic model to obtain a source acoustic model based on a corpus and a voice posterior graph model, and acquiring the self-adaptive acoustic model by transferring the learning source acoustic model according to the cloned object voice posterior graph and the cross-domain consistency loss;

the robust vocoder training module is used for training the vocoder to obtain a robust vocoder based on the corpus and the voiceprint characteristics of the corpus sample;

the voice audio synthesis module is used for acquiring phonemes and prosody of the text to be synthesized, transmitting the phonemes and prosody into the adaptive acoustic model to obtain a Mel frequency spectrum to be synthesized, transmitting the Mel frequency spectrum to be synthesized into the robust vocoder to perform voice synthesis, and outputting the synthesized audio of the target cloned object according to the voiceprint characteristics of the selected target cloned object.

Compared with the prior art, the invention has at least one of the following beneficial effects:

1. aiming at the problem that the existing voice cloning system needs paired data of text and audio, in the cloning process, firstly, a pre-trained voice posterior diagram model is utilized to obtain a cloned object voice posterior diagram, then, the voice posterior diagram and a voice posterior diagram encoder module are utilized to migrate and learn the condition layer normalization of a decoder in a source acoustic model, the self-adaptive parameters are few, the training speed is high, and the cross-domain consistency loss is introduced, so that the migration and the learning are more stable.

2. The self-adaptive vocoder is constructed by introducing voiceprint information, high-quality results can be obtained under the condition of no transfer learning, the efficiency is greatly improved, the sounds of speakers which do not appear in a corpus can be fitted, and the method is beneficial to commercial application.

In the invention, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a flow chart of a method for voice cloning based on cross-domain consistency loss in embodiment 1 of the present invention;

FIG. 2 is a diagram showing an example of text-to-audio labeling in embodiment 1 of the present invention;

FIG. 3 is a schematic view of the acoustic model structure in embodiment 1 of the present invention;

FIG. 4 is a diagram of a normalization framework for a conditional layer in a decoder according to embodiment 1 of the present invention;

fig. 5 is a schematic diagram of a vocoder according to embodiment 1 of the present invention.

Detailed Description

Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.

Example 1

The invention discloses a voice cloning method based on cross-domain consistency loss, which is shown in fig. 1 and comprises the following steps:

s11: collecting the audio of a cloned object, acquiring voiceprint characteristics of the cloned object from the audio, transmitting the audio of the cloned object into a pre-trained voice posterior map model, and acquiring a voice posterior map of the cloned object.

In this embodiment, only a small amount of audio of the cloned object needs to be collected, so that the audio is ensured to have no noise and reverberation. Illustratively, 20 pieces of audio are obtained by cloning an object public lecture, podcast, or recording. Preferably, the signal-to-noise ratio of each audio is not less than 35dB, and the duration of each audio is 5-15 seconds.

Further, the ECAPA_TDNN model is utilized to obtain the voiceprint characteristics of each piece of audio of the cloned objects, and then the average value of the voiceprint characteristics of each cloned object is calculated and used as the voiceprint characteristics of each cloned object.

The speech posterior graph model adopts a speech posterior graph PPG (phonetic posteriorgrams) model, and after the model is pre-trained through a large amount of audio in a corpus, the audio of a cloned object is transmitted to obtain the speech posterior graph of the cloned object.

Compared with the prior art, the method and the device have the advantages that only a small amount of audio of the cloned object is required to be obtained, the text is not required to be extracted from the audio, manual labeling of the text is not required, the problem of poor voice cloning effect caused by inaccurate text recognition is avoided, and the data processing efficiency is improved.

S12: training an acoustic model to obtain a source acoustic model based on the corpus and the speech posterior graph model, and transferring the learning source acoustic model according to the cloned object speech posterior graph and the cross-domain consistency loss to obtain the self-adaptive acoustic model.

It should be noted that, the corpus refers to a database containing a large amount of < text, audio > paired data, corresponding speaker IDs and audio annotation information. The text comprises three parts, namely a text number, text content and corresponding pinyin of the text, and is divided by 'I'. The text content comprises three-level rhythm marks, and the text corresponding to the pinyin comprises five-level pinyin tone marks.

Specifically, the three-level prosody identification is: #1, #2, and #3, wherein #1 represents a prosodic word boundary, #2 represents a prosodic phrase boundary, and #3 represents a prosodic phrase boundary. The five-level pinyin tone marks adopt

Arabic numerals

1,2,3,4 and 5 to mark the tones of pinyin, wherein 1,2,3 and 4 respectively represent one sound, two sounds, three sounds and four sounds of Chinese pinyin, and 5 represents light sounds.

Illustratively, a text in the corpus is in the form of:

000001|your good #2 we #1 will mail #3 with #1 parade express #1 for your # 1. I nin2hao3wo3men5jiang1yong4shun4feng1kuai4di4wei4nin2jin4xin 2 young 2you2ji4

According to the text, the speaker's audio is recorded in a quiet environment. Preferably, the audio is ensured to have no noise and reverberation, the signal to noise ratio of each recording is not lower than 35dB, the recording audio format adopts a non-compression wav format, and the audio sampling rate is 41KHz.

Finally, labeling the text and the audio by using an audio labeling tool Praat, labeling the text corresponding to each section of audio, and labeling the pinyin level and the phoneme level corresponding to the audio. The phonetic level labels correspond to three-level prosody identifiers in the text, and the phoneme level labels are formed by mapping the three-level prosody identifiers to phoneme levels through the mapping relation of phonetic and phonemes. And according to the phoneme level label, the prosody of the phoneme and phoneme level can be obtained.

Illustratively, FIG. 2 is a callout for the text "you do #2 we #1 will mail #3 with #1 paraquat express #1 for you # 1" and the corresponding audio. In fig. 2, line 1 is the time domain of audio, line 2 is the frequency domain of audio, line 3 is the pinyin level notation, and line 4 is the phoneme level notation.

It should be noted that, the mapping relation between pinyin and phonemes is pre-constructed, and the construction method includes: a phone set is first constructed according to the pronunciation characteristics of speech, each phone in the phone set being the smallest unit, i.e. the pronunciation of each phone is different from each other. And then constructing the corresponding relation of the pinyin and the phonemes by utilizing the pronunciation characteristics of the pinyin.

As shown in fig. 3, the acoustic model is used to convert linguistic information (phonemes and prosody at the phoneme level) into acoustic information (mel spectrum), including a phoneme and prosody encoder module, a speech posterior graph encoder module, a duration prediction module, a pitch prediction module, a volume prediction module, and a decoder module; wherein the phoneme and prosody encoder module, the speech posterior graph encoder module and the decoder module are each comprised of a plurality of feedforward layers of transducers; the duration prediction module, the pitch prediction module and the volume prediction module are all composed of a plurality of convolution layers.

The inputs to the phoneme and prosody encoder module are phonemes and prosodies, the inputs to the speech posterior graph encoder module are speech posterior graphs, and both module outputs are hidden vectors with context information.

Preferably, the phoneme and prosody encoder module and the speech posterior-graph encoder module are comprised of 4 transducer feedforward layers, the decoder module is comprised of 6 transducer feedforward layers, and the duration prediction module, the pitch prediction module, and the volume prediction module are each comprised of 2 convolutional layers.

Further, condition layer normalization is used in the feed-forward layer of each transducer in the decoder module, as shown in fig. 4, and includes a speaker embedding layer and two linear layers, defined as follows:

Training an acoustic model to obtain a source acoustic model based on a corpus and a speech posterior graph model, comprising:

(1) and transmitting the corpus sample in the corpus into a pre-trained voice posterior graph model to obtain a voice posterior graph of the corpus sample.

(2) And transmitting the phonetic posterior diagram of the corpus sample to a phonetic posterior diagram encoder module, and transmitting the phonemes and prosody of the corpus sample to a phonemic and prosody encoder module.

(3) And outputting speaker IDs corresponding to the corpus sample by the phoneme and prosody encoder module, sequentially inputting the speaker IDs into the time length prediction module, the pitch prediction module and the volume prediction module, respectively obtaining predicted values of pronunciation time length, pronunciation pitch and pronunciation volume of each phoneme of the corpus sample, and then inputting the predicted values and the speaker IDs into the decoder module to obtain predicted Mel frequency spectrums of the corpus sample.

When the output of the phoneme and prosody encoder module is not null, the phoneme and prosody encoder module outputs the speaker ID corresponding to the corpus sample and sequentially transmits the speaker ID to the time length prediction module, the pitch prediction module and the volume prediction module, otherwise, the speech posterior map encoder module outputs the speaker ID corresponding to the corpus sample and sequentially transmits the speaker ID to the time length prediction module, the pitch prediction module and the volume prediction module.

(4) Training to obtain a source acoustic model according to errors between the outputs of the phonetic posterior graph encoder module and the phoneme and prosody encoder module, errors between predicted values and true values of the duration, the pronunciation pitch and the pronunciation volume of each phoneme of the corpus sample, and errors between the predicted mel frequency spectrum and the true mel frequency spectrum.

It should be noted that adding the error between the outputs of the speech posterior-graph encoder module and the phoneme and prosody encoder module to the loss function of the source acoustic model facilitates constraining the speech posterior-graph encoder and the phoneme and prosody encoder to a feature space in preparation for subsequent speech cloning.

Specifically, the optimization objective of the source acoustic model is expressed by the following formula:

wherein A is a source acoustic model in the training process, A _s Representing a trained source acoustic model of the source,

to predict the Mean Absolute Error (MAE) between the Mel spectrum and the real Mel spectrum +.>

A kind of electronic device with high-pressure air-conditioning system

Mean Square Error (MSE) between predicted and true values of phoneme duration, pronunciation pitch and pronunciation volume, respectively,>

mean square error between the outputs of the speech posterior graph encoder module and the phoneme and prosody encoder module; lambda (lambda) _mel 、λ _d 、λ _p 、λ _e Lambda (lambda) _ppg Is a coefficient of each error. Illustratively, the error coefficients are each set to 1.

It should be noted that, in this embodiment, a large number of high-quality corpus samples in the corpus are defined as a source domain, a small amount of audio of the cloned object is defined as a target domain, and the process of transfer learning is to convert the feature space of the acoustic model in the source domain into the feature space of the target domain. However, in the conversion process, the loss function utilized by the model training and the final evaluation score are unmatched due to the extremely small data size of the target domain, that is, the smaller the loss function of the model is, the smoother and more natural the speech is not represented. Therefore, when the adaptive acoustic model is built by using transfer learning, whether the adaptive acoustic model is trained or not cannot be judged by the decline of the loss function, and the model is easy to be subjected to over-fitting and under-fitting. According to the embodiment, the cross-domain consistency loss is introduced, and the function of the loss function in transfer learning is used for restricting the source acoustic model to excessively fit the target domain data, so that the purpose of stable training is achieved.

Specifically, the learning source acoustic model is migrated according to the clone object voice posterior diagram and the cross-domain consistency loss, and based on the trained source acoustic model, training is continued according to the clone object voice posterior diagram, and the consistency loss of the source domain and the target domain is considered in training so as to obtain the self-adaptive acoustic model, which comprises the following steps:

(1) and constructing a target acoustic model according to the trained source acoustic model, wherein in the target acoustic model, other parameters are solidified besides the condition layer normalization parameters in the feedforward layer of each transducer in the decoder module. Namely: the model parameters of the trained source acoustic model are taken as initial model parameters of the target acoustic model, and only the condition layer normalization parameters in the decoder module are used for migration learning of the target acoustic model.

(2) And taking the cloned object voice posterior graph as a learning sample to be transmitted into a trained source acoustic model and a trained target acoustic model, calculating the distance between the characteristic spaces of decoder modules in the source acoustic model and the target acoustic model, taking the distance as cross-domain consistency loss, taking the loss of the predicted Mel spectrum output by the target acoustic model and the loss of the real Mel spectrum of the learning sample as a loss function, and training the target acoustic model to obtain the self-adaptive acoustic model.

In the transfer learning, only the clone object speech posterior diagram is transferred to the speech posterior diagram encoder module in batches in the source acoustic model and the target acoustic model, the hidden vector with the context information is output by the speech posterior diagram encoder module, and then the hidden vector is input to the duration prediction module, the pitch prediction module and the volume prediction module together with the speaker ID to obtain the predicted values of the pronunciation duration, the pronunciation pitch and the pronunciation volume of each phoneme respectively. The predicted value and speaker ID are then passed into each layer of the decoder module, where the distance of the feature space of the decoder module in the source acoustic model and the target acoustic model is calculated.

Specifically, the method comprises the following steps:

summarizing and averaging the distances between the two probability distributions for each layer for all the learning samples in the current batch as the distances of the feature spaces of the decoder modules in the current batch source acoustic model and the target acoustic model, i.e., cross-domain consistency loss

The formula is as follows:

wherein A is _s→t Representing an acoustic model of the target, A _s Representing a trained source acoustic model, D _KL Represents KL divergence distance, l represents normalization of the first conditional layer in the acoustic model, m _i Representing learning samples i, P in a current lot _m (m) represents a feature space,

representing mathematical expectations, the averaging is represented in equation (3). />

Representing the normalized probability distribution of learning sample i at the first layer of conditional layer of the source acoustic model, +.>

The probability distribution of the learning sample i normalized at the first layer condition layer of the target acoustic model is represented as follows:

where sim (·) represents a cosine similarity function and j represents a study sample j of the same batch as i. The Softmax network layer does not belong to the acoustic model, here the probability distribution is calculated using a separate Softmax network layer.

Adaptive acoustic model

The optimization objective of (1) is expressed by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,

mean absolute error, lambda, of predicted mel spectrum and true mel spectrum of learning sample representing target acoustic model output _cd Weight coefficient, lambda, for cross-domain consistency loss _mel Is the weighting coefficient of mel spectrum loss. Preferably lambda _cd 200 lambda _mel 45.

After the transfer learning is finished, the trained target acoustic model is used as a self-adaptive acoustic model for actual speech synthesis.

Compared with the prior art, the method and the device for learning the source acoustic model by using the audio migration of a small amount of cloned objects utilize the condition layer normalization of the decoder in the source acoustic model to learn, have fewer self-adaptive parameters and high training speed, introduce cross-domain consistency loss, avoid the source acoustic model from excessively fitting target data, and enable the migration learning to be more stable.

S13: and training the vocoder based on the corpus and voiceprint characteristics of the corpus samples to obtain a robust vocoder.

It should be noted that, as shown in fig. 5, the vocoder is based on a HiFi-GAN model, and includes a generator module and a discriminator module, where the generator module includes transposed convolution layers of different receptive fields, and adds a one-dimensional convolution operation to each layer. The discriminator module consists of a multi-period discriminator and a multi-scale discriminator. The transpose convolution layer is used for upsampling the mel frequency spectrum so as to obtain an audio waveform; the one-dimensional convolution operation is used for mapping the voiceprint feature into a voiceprint feature vector with the same dimension as the Mel spectrum hidden variable feature of the current layer; the discriminator module is used to determine whether the audio is synthesized or authentic, thereby causing the generator to generate more realistic audio.

Illustratively, the mel spectrum dimensionality of the input generator module is [ T,80], the initial voiceprint feature dimensionality is [1,256], the one-dimensional convolution in the first layer maps the voiceprint feature to the [1,80] dimensionality, then T-times repeats the voiceprint feature to obtain a voiceprint feature with the dimensionality of [ T,80], and finally the mel spectrum hidden variable feature of the first layer and the mapped voiceprint feature are directly added to obtain an output of the first layer and are transmitted to the second layer of the generator. The above steps are repeated, adding voiceprint features to each layer in the vocoder's generator.

Based on a corpus, according to an ECAPA_TDNN model, obtaining voice print characteristics of a speaker corresponding to a corpus sample in the corpus, training a vocoder based on the corpus and the voice print characteristics of the corpus sample to obtain a robust vocoder, comprising:

Specifically, the loss function of the generator module is:

where G represents the generator module, D represents the discriminator module, and q represents the mel spectrum.

The loss function of the discriminator module is:

where p represents real audio.

For stability of vocoder training, this embodiment adds a mel spectrum loss function and a feature matching loss function of the discriminator module, as follows:

wherein the phi (·) function is a function for calculating mel frequency spectrum, T represents the number of discriminator modules, D ^t And N _t The features of the t-th discriminator module and the number of features are respectively indicated.

Compared with the prior art, the voice print feature of the speaker is introduced when the vocoder is trained, and a high-quality result can be obtained under the condition of no transfer learning, so that the efficiency is greatly improved.

S14: obtaining phonemes and prosody of a text to be synthesized, transmitting the phonemes and prosody into a self-adaptive acoustic model to obtain a Mel frequency spectrum to be synthesized, transmitting the Mel frequency spectrum to be synthesized into a robust vocoder to perform voice synthesis, and outputting synthesized audio of a target clone object according to voiceprint characteristics of the selected target clone object.

It should be noted that, a prosody prediction model is constructed by using two layers of long-short-term memory neural networks (LSTM), and after training, the prosody of the text to be synthesized is obtained by transmitting the text to be synthesized.

Converting the text to be synthesized into pinyin, and acquiring the phonemes of the text to be synthesized according to the pre-constructed mapping relation between the pinyin and the phonemes.

And transmitting phonemes and prosody of the text to be synthesized into the self-adaptive acoustic model to obtain a Mel frequency spectrum of the selected target clone object, transmitting the Mel frequency spectrum to be synthesized into a robust vocoder as the Mel frequency spectrum to be synthesized for speech synthesis, adding a voiceprint feature vector of voiceprint features of the selected target clone object mapped at the current layer into each layer of Mel frequency spectrum hidden variable features of the generator in the robust vocoder, and finally outputting synthesized audio of the target clone object by the generator in the robust vocoder.

Compared with the prior art, the voice cloning method based on the cross-domain consistency loss aims at the problem that the existing voice cloning system needs limitation of data in pairs of text and audio, in the cloning process, firstly, a pre-training voice posterior diagram model is utilized to obtain a cloning object voice posterior diagram, then, the voice posterior diagram and a voice posterior diagram encoder module are utilized to transfer and learn the condition layer normalization of a decoder in a source acoustic model, the self-adaptive parameters are few, the training speed is high, and the cross-domain consistency loss is introduced, so that the transfer and learning are more stable; during actual speech synthesis, the voice of the speaker which does not appear in the corpus can be fitted by changing the voice print characteristics of the speaker into the voice print characteristics of the cloned object, so that the speech synthesis efficiency is greatly improved, and the method is beneficial to commercial application.

Example 2

In another embodiment of the present invention, a voice cloning system based on cross-domain consistency loss is disclosed, so as to implement the voice cloning method based on cross-domain consistency loss in embodiment 1. The specific implementation of each module is described with reference to the corresponding description in embodiment 1. The system comprises:

Preferably, a voice cloning system is built by building an API service mode, and a text to be synthesized is input by calling an API result, and synthesized audio similar to a cloning object is returned. Model training and API interface are realized by Python programming.

The quantitative evaluation index of the speech cloning system selects a mean opinion score (MOS, mean Opinion Score) and a mean similarity score (SMOS, similarity Mean Opinion Score). The MOS is used for evaluating the quality and fluency of the synthesized voice, the score is 1-5 points, and the higher the score is, the smoother the synthesized voice is, and the higher the naturalness is. SMOS is mainly used to evaluate the similarity of cloned speech to original speech, with scores of 1-5 points, with higher scores representing more similar cloned speech to original speech. The MOS score of the voice cloning system constructed based on the embodiment can reach 4.17, and the SMOS score can reach 4.02.

Because the relevant parts of the voice cloning system based on the cross-domain consistency loss and the voice cloning method based on the cross-domain consistency loss can be mutually referred to, the description is repeated here, and therefore, the description is not repeated here. The principle of the system embodiment is the same as that of the method embodiment, so the system embodiment also has the corresponding technical effects of the method embodiment.

Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A voice cloning method based on cross-domain consistency loss is characterized by comprising the following steps:

2. The method of claim 1, wherein the acoustic model comprises a phoneme and prosody encoder module, a speech posterior graph encoder module, a duration prediction module, a pitch prediction module, a volume prediction module, and a decoder module; wherein the phoneme and prosody encoder module, the speech posterior graph encoder module and the decoder module are each comprised of a plurality of feedforward layers of transducers; the duration prediction module, the pitch prediction module and the volume prediction module are all composed of a plurality of convolution layers.

3. The method of claim 2, wherein conditional layer normalization is used in the feed-forward layer of each transducer in the decoder module, the conditional layer normalization comprising a speaker-embedded layer and two linear layers, defined as follows:

4. The method for cloning speech based on cross-domain consistency loss according to claim 3, wherein training the acoustic model to obtain the source acoustic model based on the corpus and the speech posterior graph model comprises:

5. The method for voice cloning based on cross-domain consistency loss according to claim 4, wherein the obtaining the adaptive acoustic model according to the cloning object voice posterior graph and the cross-domain consistency loss transfer learning source acoustic model comprises:

6. The method for voice cloning based on cross-domain consistency loss according to claim 5, wherein the step of transferring the cloned object voice posterior graph as a learning sample into the source acoustic model and the target acoustic model, and calculating the distance between the feature spaces of the decoder modules in the source acoustic model and the target acoustic model, comprises:

7. The method of claim 1, wherein the vocoder is based on a HiFi-GAN model, comprising a generator module and a discriminator module, wherein the generator module comprises transposed convolution layers of different receptive fields, and wherein a one-dimensional convolution operation is added to each layer for mapping voiceprint features into voiceprint feature vectors of the same dimension as the mel-spectrum hidden variable features of the current layer.

8. The method for voice cloning based on cross-domain consistency loss according to claim 1, wherein the training the vocoder to obtain the robust vocoder based on the voice print characteristics of the corpus and the corpus sample comprises:

9. The voice cloning method based on cross-domain consistency loss according to claim 1, wherein the step of transmitting the mel spectrum to be synthesized into a robust vocoder to perform voice synthesis, and outputting the synthesized audio of the target cloned object according to the voice print characteristics of the selected target cloned object, wherein the step of adding the voice print characteristic vector of the voice print characteristics of the selected target cloned object mapped at the current layer into each layer of the mel spectrum hidden variable characteristics of the generator in the robust vocoder, and the step of finally outputting the synthesized audio of the target cloned object by the generator in the robust vocoder.

10. A cross-domain consistency loss-based speech cloning system, comprising: