CN112002303A

CN112002303A - End-to-end speech synthesis training method and system based on knowledge distillation

Info

Publication number: CN112002303A
Application number: CN202010718085.2A
Authority: CN
Inventors: 贺来朋
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-11-27
Anticipated expiration: 2040-07-23
Also published as: CN112002303B

Abstract

The invention provides a knowledge distillation-based end-to-end speech synthesis training method and a knowledge distillation-based end-to-end speech synthesis training system, wherein the method comprises the following steps: step 1: acquiring original training data; step 2: training a teacher model by using the original training data; and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model; and 4, step 4: and performing end-to-end speech synthesis by using the trained student model. According to the method, a knowledge distillation method is adopted, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end voice synthesis is carried out by using the trained student model, so that the problem of poor hearing of the out-of-collection synthesized voice caused by mismatching of training and testing can be effectively solved.

Description

End-to-end speech synthesis training method and system based on knowledge distillation

Technical Field

The invention relates to the technical field of speech synthesis, in particular to an end-to-end speech synthesis training method and system based on knowledge distillation.

Background

At present, an end-to-end speech synthesis system generally comprises an acoustic feature parameter prediction module and a synthesizer module, wherein the acoustic feature parameter prediction module generally adopts a sequence-to-sequence modeling method and comprises sub-modules such as Embedding, Encoder-Decoder and Post-Net. The synthesizer module typically employs a vocoder based on acoustic signal processing, or a neural network vocoder. And the original training data used for training the end-to-end synthesis system comprises audio data and corresponding pronunciation texts, wherein the acoustic characteristic parameter prediction module is trained by the pronunciation text data and acoustic characteristic parameters extracted from the audio.

A Decoder submodule in the acoustic characteristic parameter prediction module takes GT acoustic characteristic parameter of a previous frame as the input of a current frame during training; during testing, the Decoder predicted output of the previous frame is used as the input of the current frame. Because the model prediction always has errors, the GT acoustic characteristic parameter and the characteristic parameter predicted by the model are respectively used as input during the model training and the model testing, and the problem of mismatching exists, which can lead to the deterioration of the prediction precision of the out-of-set acoustic characteristic parameter during the testing and further lead to the deterioration of the hearing sense of the out-of-set synthesized speech.

Disclosure of Invention

The invention provides a knowledge distillation-based end-to-end speech synthesis training method and system, which are used for avoiding the problem of bad hearing of the synthesized speech outside the set caused by mismatching of training and testing.

The invention provides an end-to-end speech synthesis training method based on knowledge distillation, which comprises the following steps:

step 1: acquiring original training data;

step 2: training a teacher model by using the original training data;

and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model;

and 4, step 4: and performing end-to-end speech synthesis by using the trained student model.

Further, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.

Further, the step 2: training a teacher model by using the original training data, and executing the following steps:

step S21: extracting GT acoustic feature parameters from the training audio in the original training data;

step S22: and training an acoustic characteristic parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the teacher model.

Further, in step S22, when training the decoding submodule of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.

Further, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:

step S31: inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode to obtain first GTA acoustic characteristic parameters;

step S32: and training an acoustic characteristic parameter prediction model by using the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the student model.

Further, in step S32, when training the decoding submodule of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.

Further, the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:

step S41: the student model is used as an acoustic characteristic parameter prediction model, an in-set training pronunciation text is input into the student model, and an in-set acoustic characteristic parameter is predicted and generated in a GTA mode to obtain a second GTA acoustic characteristic parameter;

step S42: training a neural network vocoder by using the training audio and the second GTA acoustic characteristic parameter predicted by the student model as input;

step S43: and performing end-to-end voice synthesis by using the neural network vocoder as a voice synthesizer.

The end-to-end speech synthesis training method based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: by adopting a knowledge distillation method, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end speech synthesis is carried out by utilizing the trained student model, so that the problem of poor hearing of the out-of-focus synthesized speech caused by mismatching of training and testing can be effectively solved.

The invention also provides an end-to-end speech synthesis training system based on knowledge distillation, comprising:

the acquisition module is used for acquiring original training data;

the teacher model training module is used for training a teacher model by using the original training data;

the student model training module is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;

and the voice synthesis module is used for performing end-to-end voice synthesis by using the trained student model.

Further, the teacher model training module comprises:

a GT acoustic feature parameter extraction unit, configured to extract GT acoustic feature parameters from the training audio in the original training data;

and the teacher model training unit is used for training an acoustic feature parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic feature parameters as training data, and using the trained acoustic feature parameter prediction model as the teacher model.

Further, the student model training module comprises:

the first GTA acoustic characteristic parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters in a GTA mode to obtain first GTA acoustic characteristic parameters;

and the student model training unit is used for training an acoustic characteristic parameter prediction model by adopting the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, and taking the trained acoustic characteristic parameter prediction model as the student model.

The end-to-end speech synthesis training system based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: the knowledge distillation technology is adopted, the teacher model is trained through the teacher model training module, the student model training module is utilized, acoustic characteristic parameters predicted by the teacher model serve as input, the student model is trained, end-to-end voice synthesis is carried out through the trained student model, and the problem that the hearing of the out-of-collection synthesized voice is poor due to mismatching of training and testing can be effectively solved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic flow chart of a knowledge-based distillation end-to-end speech synthesis training method according to an embodiment of the present invention;

FIG. 2 is a block diagram of an end-to-end speech synthesis training system based on knowledge distillation according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

An embodiment of the present invention provides an end-to-end speech synthesis training method based on knowledge distillation, as shown in fig. 1, the method performs the following steps:

step 1: acquiring original training data;

step 2: training a teacher model by using the original training data;

The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has a significantly reduced effect of the out-of-set synthesis, and an important reason is that model training and testing are not matched. A decoding submodule (Decoder) of the acoustic characteristic parameter prediction model uses GT acoustic characteristic parameter of a previous frame as input during training, and uses the prediction output of the Decoder of the previous frame as current input during testing, and the mismatching can cause the prediction precision of the out-of-set acoustic characteristic parameter to be poor during testing, thereby causing the hearing sense of the out-of-set synthesized speech to be poor.

The knowledge distillation principle is applied to the training of an end-to-end speech synthesis system, after original training data are obtained, a teacher model is trained by using the original training data, and then a student model is trained by using characteristic parameters predicted by the teacher model as training data; finally, the trained student model is used for predicting the acoustic characteristic parameters so as to carry out end-to-end speech synthesis.

Wherein, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.

The beneficial effects of the above technical scheme are: by adopting a knowledge distillation method, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end speech synthesis is carried out by utilizing the trained student model, so that the problem of poor hearing of the out-of-focus synthesized speech caused by mismatching of training and testing can be effectively solved.

In one embodiment, the step 2: training a teacher model by using the original training data, and executing the following steps:

The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio are referred to as gt (ground truth) acoustic feature parameters.

Further, in the step S22, when training the decoding sub-module (Decoder) of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.

The beneficial effects of the above technical scheme are: specific steps are provided for training a teacher's model using raw training data.

In one embodiment, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:

The working principle of the technical scheme is as follows: the gta (group try align) method is to predict the acoustic feature of the current frame by using the GT acoustic feature parameter of the previous frame as input when reasoning in the decoding sub-module (Decoder). And predicting the generated in-set acoustic characteristic parameters by adopting a GTA mode, and calling the in-set acoustic characteristic parameters as first GTA acoustic characteristic parameters. The acoustic characteristic parameters are predicted by using a teacher model in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.

The beneficial effects of the above technical scheme are: the specific steps of training the student model by using the characteristic parameters predicted by the teacher model as training data are provided, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.

In one embodiment, the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:

The working principle of the technical scheme is as follows: firstly, inputting an in-set training text by using the student model obtained in the step 3, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode; then, training a neural network vocoder by taking training audio in original training data and a second GTA acoustic characteristic parameter predicted by a student model as input; and finally, adopting a student model as an acoustic characteristic parameter prediction model, and adopting the neural network vocoder as a synthesizer, namely the finally used end-to-end speech synthesis system.

The beneficial effects of the above technical scheme are: specific steps are provided for end-to-end speech synthesis using trained student models.

As shown in fig. 2, an embodiment of the present invention provides an end-to-end speech synthesis training system based on knowledge distillation, including:

an obtaining module 201, configured to obtain original training data;

a teacher model training module 202, configured to train a teacher model by using the original training data;

the student model training module 203 is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;

and the speech synthesis module 204 is configured to perform end-to-end speech synthesis by using the trained student model.

The invention applies the knowledge distillation principle to the training of the end-to-end speech synthesis system, and the acquisition module 201 acquires the original training data; the teacher model training module 202 trains the teacher model by using the original training data; the student model training module 203 takes the acoustic characteristic parameters predicted by the teacher model as training data to train the student model; and the speech synthesis module 204 is used for making a prediction of the acoustic characteristic parameters by using the trained student model so as to perform end-to-end speech synthesis.

The original training data acquired by the acquisition module 201 includes training audio and pronunciation text corresponding to the training audio.

The beneficial effects of the above technical scheme are: the knowledge distillation technology is adopted, the teacher model is trained through the teacher model training module, the student model training module is utilized, acoustic characteristic parameters predicted by the teacher model serve as input, the student model is trained, end-to-end voice synthesis is carried out through the trained student model, and the problem that the hearing of the out-of-collection synthesized voice is poor due to mismatching of training and testing can be effectively solved.

In one embodiment, the teacher model training module 202 includes:

The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio by the GT acoustic feature parameter extraction unit are referred to as GT (ground truth) acoustic feature parameters.

Further, a teacher model training unit uses the GT acoustic feature of the current frame as a target output and the GT acoustic feature of the previous frame as an input when training a decoding sub-module (Decoder) of the teacher model.

The beneficial effects of the above technical scheme are: by means of the GT acoustic feature parameter extraction unit and the teacher model training unit, training of the teacher model can be achieved.

In one embodiment, the student model training module 203 comprises:

Further, when the student model training unit trains the decoding sub-modules of the student model, the student model training unit takes the first GTA acoustic feature parameter of the previous frame as input, and takes the GT acoustic feature parameter of the current frame as target output.

The beneficial effects of the above technical scheme are: by means of the first GTA acoustic characteristic parameter prediction unit and the student model training unit, training of a student model can be achieved, and GT acoustic characteristic and predicted acoustic characteristic parameters can be aligned in time length, so that the problem of data time length alignment is solved.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A knowledge-distillation-based end-to-end speech synthesis training method, characterized in that the method performs the steps of:

step 1: acquiring original training data;

step 2: training a teacher model by using the original training data;

2. The method of claim 1, wherein in step 1, the raw training data includes training audio and pronunciation text corresponding to the training audio.

3. The method of claim 2, wherein the step 2: training a teacher model by using the original training data, and executing the following steps:

4. The method according to claim 3, wherein in step S22, GT acoustic feature of the current frame is used as a target output and GT acoustic feature of the previous frame is used as an input in training the decoding sub-module of the teacher model.

5. The method of claim 3, wherein step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:

6. The method of claim 5, wherein in step S32, in training the decoding sub-module of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.

7. The method of claim 1, wherein the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:

8. A knowledge-distillation-based end-to-end speech synthesis training system, comprising:

the acquisition module is used for acquiring original training data;

9. The system of claim 8, wherein the teacher model training module comprises:

10. The system of claim 8, wherein the student model training module comprises: