CN114495898A

CN114495898A - Training method and system for unified speech synthesis and speech conversion

Info

Publication number: CN114495898A
Application number: CN202210395964.5A
Authority: CN
Inventors: 陶建华; 汪涛; 易江燕; 傅睿博; 张震
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-05-13
Anticipated expiration: 2042-04-15
Also published as: CN114495898B

Abstract

The invention provides a training method and a training system for unified speech synthesis and speech conversion. The method comprises the following steps: decoupling the encoding tasks of voice synthesis and voice conversion into three subtasks, namely extracting content information, extracting speaker information and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice; and inputting the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information. The scheme provided by the invention unifies the voice synthesis and voice conversion models, thereby avoiding the difficulty of independent construction; the use of label-free speech improves the performance of speech synthesis and speech conversion.

Description

Training method and system for unified speech synthesis and speech conversion

Technical Field

The invention belongs to the technical field of voice cloning, and particularly relates to a training method and a training system for unified voice synthesis and voice conversion.

Background

Cloning the voice of a target speaker is an attractive technology, and can be applied to various scenes, such as entertainment creation, personalized mobile assistants, the safety field and the like. The most desirable phonetic cloning operation is to reference only a single utterance of speech of the target speaker that has not been seen, and then synthesize any speech of the target speaker, which is called single sample phonetic cloning. In the field of speech research, speech synthesis technology and speech conversion technology are two mainstream ways to implement speech cloning. Both technologies have been separately studied and developed as independent tasks.

TTS (text-to-speech) speech synthesis;

VC (voice conversion);

although TTS and VC technologies are two important methods for voice cloning, the two technologies have been separately studied and developed as independent tasks in the past, and there is not much interaction between them. The reason for the difficulty is because the speech content representations of the two tasks are different. In particular, the speech content in TTS is obtained by text information, and the text and speech in TTS are two unequal sequences, which usually require attention mechanisms to align them. However, the attention mechanism is often influenced by the information of the speaker, so that the voice content representation irrelevant to the speaker cannot be learned. Whereas in VC, the source and target speech are aligned over the speech content, so the speech content can be extracted directly from the source speech, unlike TTS.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a technical solution of a training method and system for unified speech synthesis and speech conversion, so as to solve the above technical problems.

The invention discloses a training method for unified speech synthesis and speech conversion in a first aspect, which comprises the following steps:

step S1, decoupling the encoding task of voice synthesis and voice conversion into three subtasks, namely extracting content information, extracting speaker information and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;

and step S2, inputting the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.

According to the method of the first aspect of the invention, in the sub-task of extracting the content information, for speech synthesis, the source of the speech content is text, the text is encoded using a text encoder to obtain a context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.

According to the method of the first aspect of the present invention, in the sub-task of extracting the content information, for the speech conversion, since the source speech is aligned with the target speech, the content information of the speech is extracted from the source speech directly using the content encoder.

According to the method of the first aspect of the invention, the encoder for extracting the speaker information is shared in the tasks of speech synthesis and speech conversion, the speaker information being extracted directly from the speech without text.

According to the method of the first aspect of the present invention, the extracted encoder of the prosodic information is shared in the tasks of speech synthesis and speech conversion, and the fundamental frequency information is directly extracted from speech as the prosodic information.

According to the method of the first aspect of the present invention, in the training phase, the content information of the speech and the speaker information are used as input to predict the fundamental frequency information.

According to the method of the first aspect of the invention, the overall loss function of speech synthesis and speech conversion comprises three parts, the loss function of speech synthesis, the loss function of speech conversion and the additional content information loss function, i.e.,

total loss function = loss function of speech synthesis + loss function of speech conversion + additional content information loss function;

the method for constructing the additional content information loss function comprises the following steps:

coding by using a vector quantization codebook to respectively extract content information of speech synthesis and content information of speech conversion;

then, the difference value between the content information of the voice synthesis and the content information of the voice conversion is applied to construct the additional content information loss function;

the loss function for speech synthesis consists of two parts, one part is the reconstruction loss of the speech for speech synthesis, and the other part is the reconstruction loss of the fundamental frequency for speech synthesis, i.e.,

loss function of speech synthesis = reconstruction loss of speech synthesis + reconstruction loss of fundamental frequency of speech synthesis;

the loss function for a speech conversion consists of two parts, one part is the reconstruction loss of the speech for the speech conversion and the other part is the reconstruction loss of the fundamental frequency for the speech conversion, i.e.,

loss function of speech conversion = reconstruction loss of speech conversion + reconstruction loss of fundamental frequency of speech conversion.

The second aspect of the present invention discloses a training system for unified speech synthesis and speech conversion, the system comprising:

the first processing module is configured to decouple the encoding tasks of voice synthesis and voice conversion into three subtasks, namely extraction of content information, extraction of speaker information and extraction of prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;

and the second processing module is configured to input the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.

According to the system of the second aspect of the present invention, the first processing module is configured to, in the sub-task of extracting the content information, for speech synthesis, the source of the speech content is text, encode the text using a text encoder to obtain a context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.

According to the system of the second aspect of the present invention, the first processing module is configured to, in the sub-task of extracting the content information, extract the content information of the speech from the source speech directly using the content encoder for the speech conversion, since the source speech is aligned with the target speech.

According to the system of the second aspect of the present invention, the first processing module is configured such that the extracted encoder of the prosodic information is shared in tasks of speech synthesis and speech conversion, and fundamental frequency information is directly extracted from speech as the prosodic information.

According to the system of the second aspect of the present invention, the first processing module is configured to predict the fundamental frequency information by using the content information of the speech and the speaker information as input in the training stage.

The system according to the second aspect of the invention, the first processing module is configured such that the total loss function of speech synthesis and speech conversion comprises three parts, a loss function of speech synthesis, a loss function of speech conversion and an additional content information loss function, i.e.,

the constructing of the additional content information loss function comprises:

loss function of speech conversion = loss of reconstruction of speech conversion + loss of reconstruction of fundamental frequency of speech conversion.

A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor executes the computer program to realize the steps of the training method for unified speech synthesis and speech conversion in any one of the first aspect of the disclosure.

A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method for training unified speech synthesis and speech conversion according to any one of the first aspect of the present disclosure.

The scheme provided by the invention has the following beneficial effects:

1. the voice synthesis and the voice conversion model are unified, and the difficulty of independent construction is avoided.

2. The use of label-free speech improves the performance of speech synthesis and speech conversion.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a training method for unified speech synthesis and speech conversion according to an embodiment of the present invention;

FIG. 2 is a block diagram of a training method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a subtask according to an embodiment of the invention;

FIG. 4 is a block diagram of a unified speech synthesis and speech conversion training system according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a training method for unified voice synthesis and voice conversion in a first aspect. Fig. 1 is a flowchart of a training method for unified speech synthesis and speech conversion according to an embodiment of the present invention, as shown in fig. 1, the method includes:

In step S1, the encoding task of speech synthesis and speech conversion is decoupled into three subtasks, namely, extraction of content information, extraction of speaker information, and extraction of prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: speaker characteristics such as timbre, volume, etc.; the prosodic information indicates how the speaker speaks the content information, reflecting the rhythm of the speech.

In some embodiments, modeling accurate speech content information is important in order to generate intelligible speech signals. Since TTS and VC have different types of input signals, the method of extracting content information is different. In the sub-task of extraction of the content information, for speech synthesis, the source of the speech content is a text, the text is encoded using a text encoder to obtain a context representation; upsampling the context representation according to duration information of phonemes to obtain content information of speech; for speech conversion, a content encoder is used directly to extract the content information of the speech from the source speech, since the source speech is aligned with the target speech.

In some embodiments, the extracted encoder of the speaker information is shared in the tasks of speech synthesis and speech conversion, extracting the speaker information directly from the speech without text; a large amount of data without text labels in the VC task can be used for training, which is helpful for improving the migration learning capability of the model.

In some embodiments, the fundamental frequency information is used to extract as prosodic information, since the fundamental frequency information can reflect the rhythm of the speech.

In some embodiments, the extracted encoder of prosodic information is shared among the tasks of speech synthesis and speech conversion, extracting fundamental frequency information directly from speech as the prosodic information; in the training stage, the content information of the voice and the information of the speaker are used as input to predict the fundamental frequency information.

In some embodiments, the overall loss function for speech synthesis and speech conversion includes three parts, a loss function for speech synthesis, a loss function for speech conversion, and an additional content information loss function, i.e.,

Detailed description of the preferred embodiment

A training process:

1. the input TTS data, i.e., text x and its corresponding speech y, is first trained.

Extracting fundamental frequency information F0 and speaker information S from y, and extracting prosody information P through F0;

and extracting the speaking content information C from the x.

The extraction process of these three kinds of information can be formulated as:

C = VQ(text_encoder(x))

S = speaker_encoder(y)

P = prosody_encoder(y)

VQ is expressed as vector quantization;

text _ encoder (.) is denoted as a text encoder;

a speaker _ encoder (.) is represented as a speaker information encoder;

prosody _ encoder (.) is denoted as a prosody information encoder;

and finally, adding the three information, and inputting the added information into a decoder module to restore y information.

y’ = decoder(C + S + P)

F0’ = pitch_predictor(C+S)

The loss of this process consists of two parts, one is the reconstruction loss of y and one is the reconstruction loss of the fundamental frequency.

2. Then training the VC data, only using the voice data y. (y includes both unlabeled and labeled data)

Unlike TTS, the speech content information is then extracted directly from y.

C’ = VQ(content_encoder(y))

S’ = speaker_encoder(y)

P’ = prosody_encoder(y)

y’ = decoder(C’ + S’ + P’)

F0’ = pitch_predictor(C’+S’)

3. Content information loss for TTS and VC

For the labeled data y, different speaking content information C and C' are obtained in the TTS flow and the VC flow. In order to unify TTS and VC frameworks, two works are done, namely encoding using the same VQ codebook when extracting C and C', respectively. And on the other hand, extra content loss is adopted to carry out speaking content loss supervision on the marked speech. The losses are as follows:

therefore, the total loss consists of three parts:

testing phase

According to the difference of the voice generating task, specifically, the voice synthesizing task or the voice converting task is needed, and different reasoning paths are provided.

And (3) voice synthesis task: the inference may be performed by synthesizing the speech of the target speaker in the manner shown on the left side of fig. 2.

And a voice conversion task: the inference may be performed by synthesizing the speech of the target speaker in the manner shown on the right side of fig. 2.

The implementation case is as follows:

the proposed unified TTS and VC training framework is shown in fig. 2. Specifically, the structure of each sub-module can be as shown in fig. 3. The number of feedforward transformer (FFT) blocks in the text encoder is 2 and in the decoder block is 6. In each FFT block, the dimension of the hidden state is 256. The kernel size of all one-dimensional convolutions is set to 3. The rate of conjugate was set at 0.5. The last linear layer in the decoder has a dimension of 80, which is consistent with the Mel spectral dimension. The size of the last linear layer in the encoder (text encoder, prosody information encoder, content information encoder) is 256. The Adam optimizer is used to update the parameters. The initial learning rate was 0.001, and the learning rate decreased exponentially. In the inference phase, hifigan is used as a vocoder.

In addition, an extra training is needed for a duration model, which is very common in the speech synthesis task, and the example can be realized by using a 3-layer full-connection layer.

After the model is constructed as described above, the model is trained according to the method of the first step of the implementation. And then, according to the method of the second step of the specific implementation process, the speech synthesis and the speech conversion tasks can be realized by using the same frame.

In conclusion, the solution proposed by the present invention enables,

The second aspect of the invention discloses a training system for unified speech synthesis and speech conversion. FIG. 4 is a block diagram of a unified speech synthesis and speech conversion training system according to an embodiment of the present invention; as shown in fig. 4, the system 100 includes:

a first processing module 101 configured to decouple the encoding task of speech synthesis and speech conversion into three subtasks, namely, extracting content information, extracting speaker information, and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;

and the second processing module 102 is configured to input the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.

According to the system of the second aspect of the present invention, the first processing module 101 is configured to, in the sub-task of extracting the content information, for speech synthesis, the source of the speech content is text, encode the text using a text encoder to obtain a context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.

According to the system of the second aspect of the present invention, the first processing module 101 is configured to, in the sub-task of extracting the content information, extract the content information of the speech from the source speech directly using the content encoder for the speech conversion, since the source speech is aligned with the target speech.

According to the system of the second aspect of the present invention, the first processing module 101 is configured such that the extracted encoder of the prosodic information is shared in the tasks of speech synthesis and speech conversion, and the fundamental frequency information is directly extracted from the speech as the prosodic information.

According to the system of the second aspect of the present invention, the first processing module 101 is configured to predict the fundamental frequency information by using the content information of the speech and the speaker information as input in the training stage.

The system according to the second aspect of the invention, the first processing module 101, is configured such that the total loss function of speech synthesis and speech conversion comprises three parts, a loss function of speech synthesis, a loss function of speech conversion and an additional content information loss function, i.e.,

the constructing of the additional content information loss function comprises:

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device, which are connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

It will be understood by those skilled in the art that the structure shown in fig. 5 is only a partial block diagram related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the solution of the present application is applied, and a specific electronic device may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.

A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method for training unified speech synthesis and speech conversion according to any of the first aspect of the present disclosure.

It should be noted that the technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered. The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for unified speech synthesis and speech conversion training, the method comprising:

2. The method of claim 1, wherein in the sub-task of extracting the content information, the source of the speech content is text for speech synthesis, and a text encoder is used to encode the text to obtain the context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.

3. The method of claim 2, wherein in the sub-task of extracting content information, the content information is extracted from the source speech directly using a content encoder for speech conversion because the source speech is aligned with the target speech.

4. The method of claim 1, wherein the speaker information extracting encoder is shared in speech synthesis and speech conversion tasks, and the speaker information is extracted directly from the speech without text.

5. The method of claim 1, wherein the extracted prosodic information encoder is shared between speech synthesis and speech conversion tasks, and the fundamental frequency information is extracted directly from speech as the prosodic information.

6. The method as claimed in claim 5, wherein the fundamental frequency information is predicted by using the content information of the speech and the speaker information as input in the training stage.

7. The method of claim 1, wherein the total loss function of the speech synthesis and the speech conversion comprises three parts, namely a loss function of the speech synthesis, a loss function of the speech conversion and an additional loss function of the content information,

the loss function for a speech conversion consists of two parts, one part is the reconstruction loss of the speech for the speech conversion, and the other part is the reconstruction loss of the fundamental frequency for the speech conversion, i.e.,

8. A training system for unified speech synthesis and speech conversion, the system comprising:

9. An electronic device, comprising a memory storing a computer program and a processor, wherein the processor, when executing the computer program, implements the steps of a training method for unified speech synthesis and speech conversion according to any of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of a method for training unified speech synthesis and speech conversion according to any one of claims 1 to 7.