CN115240630A

CN115240630A - Method and system for converting Chinese text into personalized voice

Info

Publication number: CN115240630A
Application number: CN202210867600.2A
Authority: CN
Inventors: 许庆阳; 滕俊; 李国光; 宋勇; 袁宪锋; 庞豹; 李贻斌
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-10-25

Abstract

The invention provides a method and a system for converting Chinese text into personalized voice, belonging to the technical field of voice synthesis.A trained speaker encoder is used for extracting speaker feature embedded vectors with fixed length from a reference voice of a speaker to be used as acoustic features of the speaker; converting the text to be converted into a Mel spectrogram corresponding to the speaker characteristic embedded vector by using a multi-speaker speech synthesis model Syn; converting the Mel frequency spectrum into corresponding time domain voice waveform, and outputting final audio; the invention integrates the implicit modeling of the self-adaptive condition module with the explicit modeling of the speaker encoder network GCNet, adopts an end-to-end feedback constraint training mechanism to clone the sound of a visible speaker and an invisible speaker, and obviously improves the naturalness and the similarity of the synthesized voice.

Description

Method and system for converting Chinese text into personalized voice

Technical Field

The invention belongs to the technical field of voice synthesis, and particularly relates to a method and a system for converting a Chinese text into a personalized voice.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Speech synthesis, also known as Text-to-Speech (TTS), is a technique aimed at synthesizing Speech waveforms given Text information. Nowadays, the technology is widely used in the daily life fields of mobile phone assistants, audio books, driving voice navigation systems, intelligent customer service, emotional accompanying robots and the like. The traditional speech synthesis technology comprises a splicing method and a statistical parameter synthesis method, and has the defects of high requirements on a phoneme speech fragment search library, poor synthesized speech quality, obvious machine sound, strong need of professional backgrounds of linguistics, acoustics and the like. Due to the rapid development of deep learning, the speech naturalness of end-to-end speech synthesis is higher, a text sequence can be directly modeled to generate a synthesis spectrogram, and finally, a frequency spectrum is converted into a speech waveform, so that the speech synthesis process becomes simple.

The speech synthesis model based on autoregressive has high synthesized speech effect quality, can be compared with human voice, but has low synthesis efficiency, and cannot synthesize corresponding voice of text in real time. The non-autoregressive model has high synthesis efficiency, and the training and reasoning speed is higher than that of the autoregressive model, but the naturalness of the synthesized voice is not as good as that of the autoregressive model. In application scenes such as human-computer interaction of an accompanying robot, the real-time performance is the first factor, and a non-autoregressive model better meets the requirement.

The method for realizing text-to-personalized speech conversion mainly comprises two types, one is a method based on self-adaptation, and the other is a method based on embedding of speaker characteristic vectors.

Based on the self-adaptive method, a large amount of high-quality voice data is utilized to train a model, so that a network with strong generalization is established, and then the network is finely adjusted by fewer speakers to be synthesized.

Based on speaker feature embedding, the method mainly comprises the steps of migratory learning by a speaker encoder (also called a speaker encoder); the speaker encoder extracts a speaker embedding vector with a fixed dimension, and the speaker embedding vector represents information of the identity of a speaker. The method can only need audio data containing speaker information, does not need to provide Chinese text labels, is simple and convenient in data collection, but the naturalness and the similarity of synthesized voice are often inferior to those of an adaptive system.

The biggest challenge for the speech synthesis technology at present is how to efficiently acquire speech with higher naturalness and similarity with a speaker by using less speaker voice data, and the problems of low efficiency and poor effect of the existing method are solved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for converting a Chinese text into an individualized voice, which integrates the implicit modeling of an adaptive condition module and the explicit modeling of a speaker encoder network GCNet, adopts an end-to-end feedback constraint training mechanism, realizes the sound cloning of a visible speaker and an invisible speaker, and obviously improves the naturalness and the similarity of the synthesized voice.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the invention provides a method for converting a Chinese text into a personalized voice in a first aspect;

a method for converting Chinese text into personalized voice comprises the following steps:

extracting a speaker characteristic embedding vector with a fixed length from a reference voice of a speaker by using a trained speaker encoder to be used as an acoustic characteristic of the speaker;

converting the text to be converted into a Mel spectrogram corresponding to the speaker characteristic embedded vector by using a multi-speaker speech synthesis model Syn;

and converting the Mel frequency spectrum into a corresponding time domain voice waveform, and outputting a final audio.

Furthermore, the training data set of the speaker coder has m × n sentences, where n is the number of speakers and m is m sentences per speaker.

Further, the loss function of the speaker encoder is:

1≤j≤m、1≤i≤n

wherein e is _ij Is the utterance embedding vector, s, of the jth utterance of the ith speaker _ij,k Is an utterance embedding vector e for each speaker _ij With speaker characteristic embedding vector se _i Cosine similarity between them.

Furthermore, the multi-speaker speech synthesis model Syn is composed of an encoder, a phoneme prosody predictor and a decoder;

in the training process of the model, feedback constraint is set, and the model is trained together with a speaker encoder, so that the multi-speaker speech synthesis model Syn can better learn the tone and tone information of the speaker.

Further, the feedback constraint specifically includes:

the synthesized Mel frequency spectrum output by the decoder is input into a speaker encoder, the distance between the speaker embedded vector of the extracted synthesized Mel frequency spectrum and the speaker embedded vector of the real audio is used as an optimization function, a mean square error function (MSE) is selected as a loss function of feedback constraint, and a multi-speaker speech synthesis model Syn is trained.

Furthermore, before the text to be converted is input into the multi-speaker speech synthesis model Syn, the text is subjected to preprocessing of converting the text into a phoneme sequence.

Furthermore, melGAN is used as a vocoder to convert the synthesized Mel frequency spectrum into a corresponding time domain voice waveform.

In a second aspect, the invention provides a system for converting Chinese text to personalized speech.

A system for converting Chinese text into personalized voice comprises a speaker coding module, a voice synthesis module and an audio conversion module;

a speaker encoding module configured to: extracting a speaker characteristic embedding vector with a fixed length from a reference voice of a speaker by using a trained speaker encoder to be used as an acoustic characteristic of the speaker;

a speech synthesis module configured to: converting the text to be converted into a Mel spectrogram corresponding to the speaker characteristic embedded vector by using a multi-speaker speech synthesis model Syn;

an audio conversion module configured to: and converting the Mel frequency spectrum into a corresponding time domain voice waveform, and outputting a final audio.

A third aspect of the present invention provides a computer readable storage medium having stored thereon a program which, when executed by a processor, performs the steps in a method for converting chinese text to personalized speech according to the first aspect of the present invention.

A fourth aspect of the present invention provides an electronic device, comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for converting chinese text to personalized speech according to the first aspect of the present invention when executing the program.

The above one or more technical solutions have the following beneficial effects:

the invention provides a Chinese voice clone model integrating a speaker self-adaptive condition module and an independently trained speaker encoder network GCNet, combines a feedback constraint training mechanism, utilizes less speaker voice data, efficiently acquires voices with higher speaker naturalness and similarity, and solves the problems of low efficiency and poor effect of the conventional method.

An adaptive condition module in the model can implicitly learn the identity information of the speaker and improve the quality of generated sound; GCNet can explicitly learn speaker characteristics from the feature space, and feedback constraints improve the similarity of speech generation for invisible speakers. The entire model can produce speech of higher quality and with higher similarity to speakers for speakers that have appeared during training, and for speakers that have never appeared during training.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of the first embodiment;

FIG. 2 is a diagram showing a structure of a model of a speaker encoder in the first embodiment;

FIG. 3 is a block diagram of a multi-talker speech synthesis model Syn in a first embodiment;

FIG. 4 is a block diagram of an adaptive condition module in the first embodiment;

fig. 5 shows the embedding of the speaker embedding vector into the synthesized mel-frequency spectrum at different positions in the first embodiment.

FIG. 6 shows speaker feature embedding vectors in the experiment of the first embodiment;

FIG. 7 is a speaker sentence gender projection in the first embodiment experiment;

FIG. 8 is a comparison between before and after the self-adaptive module is added to generate the Mel frequency spectrum in the experiment of the first embodiment;

FIG. 9 is a diagram of the real Mel spectrum and the synthesized Mel spectrum in the seen-dep corpus according to the first embodiment;

FIG. 10 shows the real Mel spectrum and the synthesized Mel spectrum in the unseen-dep corpus according to the first embodiment;

FIG. 11 shows speaker feature vector similarity in the experiment of the first embodiment;

fig. 12 is a system configuration diagram of the second embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise, and furthermore, it should be understood that the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the invention may be combined with each other without conflict.

The invention provides a Chinese speech clone model combining a self-adaptive speaker system and a speaker encoder, which is a non-autoregressive sequence-to-sequence generation model, integrates explicit timbre and speaker style modeling with implicit timbre and speaker style modeling, and adopts a feedback constraint training mechanism for improving the naturalness and similarity of synthesized speech.

Example one

The embodiment discloses a method for converting a Chinese text into a personalized voice, as shown in fig. 1, comprising the following steps:

s1, extracting a speaker characteristic embedding vector with a fixed length from a reference voice of a speaker by using a trained speaker encoder to serve as an acoustic characteristic of the speaker;

based on the advantages of convolution and GRU networks, a GCNet model is proposed as a speaker coder. The convolutional layer gives the model the ability of sensing information such as tone and the like of voice, meanwhile, parallel computation can be realized, the GRU network integrates the output of each time step in the sequence, and the context characteristics of the sequence are extracted, so that speakers can be better distinguished and convergence is faster.

Preprocessing the speaker's reference speech into a corresponding Mel's spectrum; the mel spectrum is input to a trained speaker encoder, and a speaker feature embedding vector representing acoustic features of a speaker is output.

The model structure of the speaker encoder is shown in fig. 2, and the input 80-dimensional mel spectrum passes through 2 one-dimensional convolution CONV1D layers containing 256 units, as shown in formula 1-2:

f ₁ ＝LN(ReLU(F ₁ (x))) (1)

f ₂ ＝LN(ReLU(F ₂ (f ₁ ))) (2)

wherein x is the input Mel frequency spectrum sequence, F ₁ 、F ₂ Respectively one-dimensional convolutional layer, LN means layer regularization, f ₂ Final output for two convolution layers;

f ₂ and then, outputting a state by a hidden layer of a third GRU layer through 3 GRU layers comprising 128 units, and performing L2 regularization on top-layer output of the last frame through a 128-dimensional linear layer to obtain a final 128-dimensional speaker utterance embedding vector, as shown in a formula 3-4:

h＝Recurrency(f ₂ ) (3)

e＝L2Norm(ReLU(F ₃ (h))) (4)

wherein Recurrency is a 3-layer recurrent neural network GRU, F ₃ Is a linear layer, h is the hidden layer output state of the last layer network of GRU, L2Norm is L2 regularization, e denotes the speakerThe utterance is embedded into a vector.

Extracting, in a speaker encoder, a fixed 128-dimensional speaker-utterance-embedding vector using a mel-frequency spectrum as a sequence of inputs; and the speaker encoder participates in the training and reasoning of the multi-speaker speech synthesis system, but the speaker encoder model parameters are not updated and optimized.

In the training phase of the speaker coder, there are n speakers in the training set, each speaker having m sentences. The Mel frequency spectrum X of the j sentence of the i speaker _ij (i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to m) as an input characteristic sequence; the entire output of the speaker encoder network GCNet is denoted as GC (X) _ij ；W _GC )，W _GC Representing all parameters of the entire GCNet network (including convolutional layers, GRU layers, linear layers, etc.). GC (X) _ij ；W _GC ) Obtaining a 128-dimensional speaker utterance embedding vector e after L2 regularization _ij The calculation is shown in equation 5:

e _ij meaning the speaker-embedded vector se of the jth utterance of the ith speaker _i Represented by the formula (6):

exclusive feature vector se _i ^(-j) Expressed as shown in equation 7:

s _ij，k meaning that each speaker utterance embeds a vector e _ij With speaker-embedded vector se _i The cosine similarity between them is shown in equation 8:

the GCNet loss function is shown in equation 9, and aims to make speaker utterance-embedded vectors as close as possible to the speaker-embedded vectors to which they belong, while keeping them as far away from other speaker-embedded vectors as possible.

S2, converting the text to be converted into a Mel spectrogram corresponding to the speaker characteristic embedded vector by using a multi-speaker voice synthesis model Syn;

the model for synthesizing a multi-speaker speech is a sequence-to-sequence network of non-autoregressive multi-speaker speech synthesis, also called a spectrum generation network, and takes a text phoneme sequence and a reference speaker speech embedding vector as conditional inputs to convert the input phoneme sequence into a Mel spectrogram of a speaker characteristic corresponding to the input phoneme sequence.

Model structure

The structure of the multi-speaker speech synthesis model Syn, as shown in fig. 3, includes three parts, an encoder, a phoneme prosody predictor and a decoder, and a text to be converted is pre-processed for converting the text into a phoneme sequence before being input into the multi-speaker speech synthesis model Syn; the phoneme rhythm predictor consists of a self-adaptive condition module, a GST module and a variance adapter; the decoder comprises a Mel spectrum decoder and a Post-processing network Post-Net.

The adaptive condition module is used for better modeling acoustic features with different fine granularities and comprises three modules: adaptive Module1, adaptive Module2 and Speaker Module;

adaptive Module1, used to extract the characteristics representing each speaking level of the speaker, such as the emotion, mood and rhythm of the speaking;

adaptive Module2, mainly extracting more fine-grained acoustic features, such as pitch, pronunciation time and the like of a speaker utterance phoneme sequence;

speaker Module for extracting coarse-grained acoustic features of the Speaker, such as the Speaker's overall timbre.

The adaptive condition module is shown in fig. 3, where Conv1D (m, n, p): the convolution kernel size is m, the stride is n, and the padding is p.

The phoneme sequence is encoded by an encoder module to obtain an output g _ij And a speaker utterance embedding vector e _ij Input sequence as an adaptive condition module, called p _ij 。p _ij Firstly, an Adaptive Module1 is used, as shown in formulas 10-11:

m _ij ＝LN(Re L U(Conv1D(p _ij ))) (10)

p _ij ′＝Pool(LN(Re L U(Conv1D(m _ij )))) (11)

wherein Conv1D is a one-dimensional convolutional layer, LN is a layer regularization, pool is a maximum pooling layer, p _ij ' is the output of Adaptive Module 1;

then p _ij 、p _ij ' the sum of the two, as input to Adaptive Module2, as shown in equations 12-14:

o _ij ＝p _ij +p _ij ′ (12)

O′ _ij ＝LN(Re L U(Conv1D(o _ij ))) (13)

p″ _ij ＝Conv(LN(Re L U(Conv1D(O′ _ij )))) (14)

wherein o is _ij As an input of Adaptive Module2, conv1D is a one-dimensional convolution layer, conv is a 1 x 1 convolution layer, p _ij "is the output of Adaptive Module 2;

finally p _ij "and Speaker ID (Speaker number) are added to the Speaker Module output to predict the Mel-frequency spectrum of the Speaker-specific speech, as shown in equation (15-16):

em _i ＝Embedding(Speaker ID) (15)

p _ij ′″＝em _i +p _ij ″+o _ij (16)

wherein Embedding is an Embedding layer em _i Is an embedded vector representation of Speaker ID, p _ij "' is the output of the entire adaptive conditional network module.

In addition to the above structure, the multi-speaker speech synthesis model Syn is further provided with a feedback constraint setting, namely, the multi-speaker speech synthesis network is forced to be trained together with the speaker encoder, so that the identity information such as the timbre, the tone and the like of the representative speaker can be better learned. After the model processes the network Post-Net, the output Mel frequency spectrum is sent to a speaker encoder, the speaker characteristic embedding vector of the Mel frequency spectrum is extracted to be used as feedback information, so that the multi-speaker voice synthesis model is further forced to fully learn the identity information of the speaker and the voice characteristic distribution of the same speaker. During the entire training process of the multi-speaker speech synthesis model Syn, the speaker coder does not update the optimization parameters any more, but only optimizes the speech synthesis model parameters.

For the feedback constraint setting, a speaker utterance embedding vector obtained by using a Mel frequency spectrum of a real audio and a Mel frequency spectrum subjected to a speech synthesis model are extracted by a speaker encoder, the distance between the embedding vectors of the two is taken as one of optimization functions, and a mean square error function (MSE) is selected as a loss function of the feedback constraint; so that the input of the whole multi-speaker speech synthesis system Syn is the phoneme sequence t after text preprocessing _ij And its corresponding speaker utterance embedding vector e _ij 。

Training phase

The specific process of the multiple speaker speech synthesis model Syn training is as follows:

(1) The text in training set is pre-processed and converted into phoneme sequence t _ij ；

(2) Phoneme sequence t _ij Encoding at an encoder module to obtain g _ij As shown in equation 17:

g _ij ＝Encoder(t _ij ) (17)

wherein, the Encoder is an Encoder module;

the steps of the encoder module are:

embedding phoneme words into the input phoneme sequence through an nn.

Performing position coding by using a trigonometric function in a transform, and increasing the relative position of each phoneme for word embedding;

the coder part in the transformer (except for converting the original fully-connected layer into a 2-layer one-dimensional convolutional layer (conv 1D)) is adopted as a phoneme coder, a phoneme sequence is converted into a phoneme hidden sequence, and information such as phoneme duration, energy, speaker characteristic embedded vectors and pitch is added into the hidden sequence by a subsequent phoneme prosody predictor.

(3)g _ij And utterance embedding vector e _ij And predicting prosodic information (such as phoneme pronunciation duration, phoneme fundamental frequency, displayed speaker embedding vector, phoneme energy) by a phoneme prosodic predictor to obtain the sum

As shown in equation 18:

wherein, prec is a phoneme rhythm predictor module;

(4)

and the sum of the words embedded vector eij is decoded by a decoder module to obtain

As shown in equation 19:

MSD is Mel frequency spectrum decoder module, postnet is post-processing networkThe combination of the ingredients of the Chinese medicinal preparation,

predicting a Mel frequency spectrum sequence corresponding to the utterance for the Syn;

the overall model loss function is shown in equation 20:

wherein X _ij ，e _ij ，t _ij ，d _ij ，f0 _ij ，en _ij A real mel-frequency spectrum sequence, an utterance embedding vector, a phoneme sequence, a phoneme pronunciation duration, a phoneme fundamental frequency and a phoneme energy which respectively correspond to the jth utterance of the ith speaker;

representing the predicted Mel frequency spectrum of the synthesis system

Utterance embedding vectors extracted by a speaker encoder GCNet;

representing the phoneme basic frequency, phoneme pronunciation duration and phoneme energy of the jth utterance of the predicted ith speaker; alpha, beta, epsilon is more than or equal to 0 and is a constant;

prediction phase

After the model training is completed, the reference audio is passed through the speaker encoder to obtain a 128-dimensional speaker embedding vector se representing the identity information of the reference speaker _i Pre-processing the text to be converted into phoneme sequence t _ij Embedding speaker into vector se _i And phoneme sequence t _ij Synthesizing Mel frequency spectrum sequence by trained multi-speaker speech synthesis model Syn

Merr-like spectral sequences as text correspondencesColumn prediction, the formula is:

wherein tij is the phoneme sequence of any input Chinese text,

a phoneme fundamental frequency, a phoneme pronunciation duration, a phoneme energy representing a predicted j-th utterance of the i-th speaker,

V，W _Syn ，W _V representing the synthesized speech waveform, the vocoder, all parameters of the multi-speaker speech synthesis model, and all parameters of the vocoder model, respectively.

And S3, converting the Mel frequency spectrum into a corresponding time domain voice waveform, and outputting a final audio.

To decode the synthesized mel spectrum into an audio waveform, the mel-frequency spectrum synthesized by the multi-speaker speech synthesis model Syn is converted into corresponding audio using MelGAN as a vocoder. The MelGAN is a non-autoregressive feedforward convolution structure, has higher synthesis speed under the condition of ensuring that the audio quality is not obviously reduced, and is beneficial to practical application such as human-computer interaction and the like. The input of the vocoder is a Mel frequency spectrum sequence Xij, and the corresponding audio frequency is output

During the training phase, the loss function of the vocoder network V is shown in equations 23-24:

wherein, Y _ij Where Z represents the original audio, Z represents the Gaussian noise, L _FM (G，D _k ) The function is matched for the characteristics of MelGAN.

A prediction stage for predicting the Mel frequency spectrum sequence corresponding to the text to be converted

Inputting into trained vocoder, predicting the audio frequency corresponding to the text to be converted

Experiment of

Experiments prove that the method performs voice cloning on visible speakers and invisible speakers, and the voice similarity and the voice naturalness are obviously improved.

(1) Chinese data set used for experiment

In the speaker encoder, a chinese data set provided by hilsa, a chinese data set aidatang _200zh provided by a data hall, and a corpus provided by MagicData technologies ltd are used as training corpuses. The total duration of AlSHELL-3 is 88035 h, the recording process is in quiet indoor environment, 218 speakers from different accent areas in China participate in the recording, and a high-fidelity microphone (44.1kHz, 16bit) is used; the duration of the aidataang _200zh voice is 200 hours, 600 people record, and the Android and i0S mobile phones are used for recording in a quiet indoor environment (169z, 16bit). The total duration of MagicData includes 755 hours of voice data, and 1080 people in each place of china record. In the speaker encoder experiment, training data set samples divided by the data sets are adopted, then data which do not meet the experiment requirements such as too short and too long audio frequencies are removed, and finally 1633 speakers participate in training. All audio files are re-downsampled to 16000Hz. The AISHELL-3 training set samples are used in a multi-talker speech synthesis system. In the evaluation phase, the reference audio used is derived from a test, validation set of the data set described above.

(2) Speaker embedding vector mode

The speaker embedding vector is connected to the Mel-frequency-spectrum generating model in three ways: (1) connecting the speaker-embedded vector to an encoder output; (2) connecting it to the output of the decoder; (3) The speaker-embedded vector is connected to both the encoder and decoder outputs.

Experimental results show that the performance of different embedding modes is different, and the best result is obtained by simultaneously connecting the speaker embedding vector to the output of the encoder and the decoder. If the speaker embedded vector is only added into the encoder output of each step, the synthesized voice has poor voice quality, obvious background noise and different similarity compared with real audio; the speaker embedding is added to the output of the decoder separately, and the synthesized voice is found to have strong noise, the quality of the synthesized voice is unstable and even noise, which is probably caused by the fact that the acoustic characteristic speaker embedding vector representing the identity information of the speaker and the output language sequence of the decoder are not matched; the second connection mode basically generates noise, so that the Mel frequency spectrums are not compared in the text, the other two modes generate the Mel frequency spectrums of the same text, the comparison result is shown in fig. 5, and the third embedding mode is found to be better in reconstructing the high-frequency detail texture of the Mel frequency spectrums, and is shown in a red square block. The final model, using the method proposed in (3), adds the speaker-embedded vector at each time step of the encoder and decoder outputs.

(3) Results of the evaluation

In the experiment, evaluation on the naturalness of the voice is compared from synthesized spectrum details, for the similarity of the voice, a speaker encoder which is trained by a third party in advance is adopted, speaker utterance embedding vectors are respectively extracted from a real sentence and a synthesized sentence, and the respective speaker utterance embedding vectors are reduced to a two-dimensional space by using a unified stream Approximation and Projection (UMAP) algorithm to perform visualization operation so as to observe the similarity of the synthesized voice more clearly.

Speaker encoder

In order to further understand the speaker embedded vector representation, the speaker embedded vector generated by the speaker encoder network proposed in this embodiment is reduced to two dimensions by the UMAP algorithm, so as to perform the visualization operation. This example randomly selected 12 speakers from the test set, including 6 male and 6 female speakers, each speaking 20 words. The visualization result is shown in fig. 6, the speaker embedded vector representation obtained by GCNet in the present embodiment can better distinguish the utterances of different speakers, so that the utterances of the same speaker are clustered in space, and the utterances of different speakers are scattered in space; meanwhile, the embodiment randomly selects 10 utterances from the twenty sentences of each speaker to use the UMAP projection, and the result shows that the GCNet model proposed in the embodiment can distinguish the male speakers from the female speakers well, as shown in fig. 7.

Naturalness of speech

In this section, a synthesis corpus is constructed with visible speaker text dependencies (seen in the training set and the audio text is the same, seen-dep) and invisible speaker text dependencies (not seen in the training set and the audio text is the same, unseen-dep).

For the visible speaker, in the process of synthesizing a new sentence, averaging the speaker utterance embedding vectors of all utterances to represent the final speaker embedding vector; for invisible speakers, the speech is predicted by randomly selecting the audio of twenty words (about two minutes) for a particular speaker, computing the speaker-utterance-embedding vector, and averaging.

Experiments show that when the Mel frequency spectrum of the same text is reconstructed, as shown in FIG. 8, before the self-adaptive condition module is not added, the texture can be found to be fuzzy by reconstructing the Mel frequency spectrum of the block region, which results in poor quality of synthesized sound; a training mode of a self-adaptive module and feedback constraint is added, so that the reconstruction of the Mel frequency spectrum details of the red rectangular mark region is better, and the synthesized sound quality is higher.

In addition, two sentences of speech are randomly selected from the corpus to be used as real speech, and then the mel spectrum of the real speech and the mel spectrum of the speech synthesized by the model finally proposed in the embodiment are respectively drawn, as shown in fig. 9-10. The result shows that the model provided by the embodiment can better synthesize the waveform of the original sound, although the Mel frequency spectrum in the high-frequency part can be better reconstructed, the low-frequency part can be better reconstructed, and the speech naturalness is close to the real sound of human beings.

Speaker similarity

This embodiment constructs a synthesized corpus with visible speaker text dependency (seen in the training set and the audio text is the same, see-dep), visible speaker text dependency (seen in the training set and the audio text is different, see-Indep), invisible speaker text dependency (not seen in the training set and the audio text is the same, unseen-dep), and invisible speaker text dependency (not seen in the training set and the audio text is different, unseen-Indep). For objective evaluation of speaker similarity, UMAP was used to visualize the reference audio synthesized utterance, with the results shown in fig. 11. In terms of objective evaluation, the system proposed in this embodiment, for speakers seen in training data, the synthesized speech similarity is very close; for speaker-less sample reference audio synthesis that does not occur in the corpus, the synthesized speech similarity is closer.

The invention provides a Chinese voice clone model integrating a speaker self-adaptive condition module and an independently trained speaker encoder network GCNet, which combines a feedback constraint training mechanism to realize the voice clone of Chinese multi-speaker. Experiments show that the self-adaptive condition module can implicitly learn the identity information of the speaker and improve the generated sound quality, the GCNet can explicitly learn the characteristics of the speaker from the characteristic space, and the feedback constraint improves the similarity of the invisible speaker in the voice generation. The entire model can produce speech of higher quality and with higher similarity to speakers for speakers that have appeared during training, and for speakers that have never appeared during training.

Example two

The embodiment discloses a system for converting Chinese text into personalized voice;

as shown in fig. 12, a system for converting a chinese text to a personalized speech includes a speaker coding module, a speech synthesis module, and an audio conversion module;

a speaker coding module configured to: extracting a speaker characteristic embedding vector with a fixed length from a reference voice of a speaker by using a trained speaker encoder to be used as an acoustic characteristic of the speaker;

a speech synthesis module configured to: converting a text to be converted into a Mel spectrogram corresponding to a speaker characteristic embedding vector by using a multi-speaker voice synthesis model Syn;

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in a method for chinese text to personalized speech conversion according to embodiment 1 of the present disclosure.

Example four

An object of the present embodiment is to provide an electronic device.

An electronic device, comprising a memory, a processor and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for converting a chinese text to a personalized speech according to embodiment 1 of the present disclosure.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A method for converting Chinese text into personalized voice is characterized by comprising the following steps:

2. The method of claim 1, wherein the training data set of the speaker-encoder contains m × n sentences, where n is the number of speakers and m is m sentences per speaker.

3. The method of claim 1, wherein the loss function of the speaker encoder is:

wherein e is _ij Is the j-th speech of the i-th speakerWord embedding vector, s _ij,k Is an utterance embedding vector e for each speaker _ij With speaker feature embedding vector se _i Cosine similarity between them.

4. The method of claim 1, wherein the multi-speaker speech synthesis model Syn comprises an encoder, a phoneme prosody predictor and a decoder;

5. The method for converting a chinese text to a personalized speech according to claim 1, wherein the feedback constraint specifically is:

6. The method as claimed in claim 1, wherein the pre-processing of text conversion into phoneme sequence is performed before the text to be converted is inputted into the multi-speaker speech synthesis model Syn.

7. The method of claim 1, wherein MelGAN is used as a vocoder to convert the synthesized Mel frequency spectrum into corresponding time domain speech waveform.

8. A system for converting Chinese text to personalized speech, characterized by: the device comprises a speaker coding module, a voice synthesis module and an audio conversion module;

9. Computer readable storage medium, on which a program is stored which, when being executed by a processor, carries out the steps of a method for chinese text to personalized speech according to any of the claims 1-7.

10. Electronic equipment comprising a memory, a processor and a program stored on the memory and executable on the processor, characterized in that the processor implements the steps of a method for converting chinese text to personalized speech according to any of claims 1-7 when executing the program.