CN112002303B

CN112002303B - End-to-end speech synthesis training method and system based on knowledge distillation

Info

Publication number: CN112002303B
Application number: CN202010718085.2A
Authority: CN
Inventors: 贺来朋
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2023-12-15
Anticipated expiration: 2040-07-23
Also published as: CN112002303A

Abstract

The invention provides an end-to-end voice synthesis training method and system based on knowledge distillation, wherein the method comprises the following steps: step 1: acquiring original training data; step 2: training a teacher model by utilizing the original training data; step 3: taking acoustic characteristic parameters predicted by the teacher model as training data to train the student model; step 4: and performing end-to-end voice synthesis by using the trained student model. According to the method, a knowledge distillation method is adopted, a teacher model is trained firstly, acoustic characteristic parameters predicted by the teacher model are used as input, a student model is trained, and finally the trained student model is used for end-to-end speech synthesis, so that the problem of poor hearing of the synthesized speech outside a set caused by mismatching between training and testing can be effectively solved.

Description

End-to-end speech synthesis training method and system based on knowledge distillation

Technical Field

The invention relates to the technical field of speech synthesis, in particular to an end-to-end speech synthesis training method and system based on knowledge distillation.

Background

At present, an end-to-end voice synthesis system generally comprises an acoustic characteristic parameter prediction module and a synthesizer module, wherein the acoustic characteristic parameter prediction module generally adopts a sequence-to-sequence modeling method and comprises submodules such as Embedding, encoder-Decoder, post-Net and the like. Synthesizer modules typically employ vocoders based on acoustic signal processing, or neural network vocoders. And the original training data for training the end-to-end synthesis system comprises audio data and corresponding pronunciation text, wherein the acoustic feature parameter prediction module is trained by the pronunciation text data and acoustic feature parameters extracted from the audio.

The Decoder sub-module in the acoustic feature parameter prediction module takes the GT acoustic feature parameter of the previous frame as the input of the current frame during training; and in the test, the Decoder prediction output of the previous frame is used as the input of the current frame. Because the model prediction always has errors, the GT acoustic characteristic parameters and the characteristic parameters of the model prediction are used as inputs during the model training and the model testing respectively, and the problem of mismatching exists, which can lead to poor prediction precision of the acoustic characteristic parameters outside the set during the testing, and further lead to poor hearing of the synthesized speech outside the set.

Disclosure of Invention

The invention provides an end-to-end voice synthesis training method and system based on knowledge distillation, which are used for avoiding the problem of poor hearing of synthesized voice outside a set caused by mismatching of training and testing.

The invention provides an end-to-end voice synthesis training method based on knowledge distillation, which comprises the following steps of:

step 1: acquiring original training data;

step 2: training a teacher model by utilizing the original training data;

step 3: taking acoustic characteristic parameters predicted by the teacher model as training data to train the student model;

step 4: and performing end-to-end voice synthesis by using the trained student model.

Further, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.

Further, the step 2: training the teacher model by using the original training data to execute the following steps:

step S21: extracting GT acoustic feature parameters from the training audio in the original training data;

step S22: and training an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameters in the original training data as training data, wherein the trained acoustic feature parameter prediction model is used as the teacher model.

Further, in the step S22, when the decoding submodule of the teacher model is trained, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.

Further, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training the student model, and executing the following steps:

step S31: inputting the training text in the set into the teacher model, and predicting and generating acoustic characteristic parameters in the set in a GTA mode to obtain first GTA acoustic characteristic parameters;

step S32: and training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameters and the first GTA acoustic feature parameters as training data, wherein the trained acoustic feature parameter prediction model is used as the student model.

Further, in the step S32, when the decoding submodule of the student model is trained, the first GTA acoustic feature parameter of the previous frame is taken as input, and the GT acoustic feature parameter of the current frame is taken as target output.

Further, the step 4: and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:

step S41: the student model is used as an acoustic characteristic parameter prediction model, the in-set training pronunciation text is input into the student model, and the in-set acoustic characteristic parameters are predicted and generated in a GTA mode to obtain second GTA acoustic characteristic parameters;

step S42: training a neural network vocoder by using the training audio and the second GTA acoustic feature parameter predicted by the student model as inputs;

step S43: and adopting the neural network vocoder as a voice synthesizer to perform end-to-end voice synthesis.

The end-to-end speech synthesis training method based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: the knowledge distillation method is adopted, a teacher model is trained firstly, then acoustic characteristic parameters predicted by the teacher model are used as input, a student model is trained, and finally the trained student model is used for carrying out end-to-end speech synthesis, so that the problem of poor hearing feeling of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.

The invention also provides an end-to-end speech synthesis training system based on knowledge distillation, which comprises:

the acquisition module is used for acquiring the original training data;

the teacher model training module is used for training a teacher model by utilizing the original training data;

the student model training module is used for training the student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;

and the voice synthesis module is used for carrying out end-to-end voice synthesis by utilizing the trained student model.

Further, the teacher model training module includes:

the GT acoustic feature parameter extraction unit is used for extracting GT acoustic feature parameters from training audio in the original training data;

the teacher model training unit is configured to train an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameter in the original training data as training data, and the trained acoustic feature parameter prediction model is used as the teacher model.

Further, the student model training module includes:

the first GTA acoustic feature parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic feature parameters in a GTA mode to obtain first GTA acoustic feature parameters;

and the student model training unit is used for training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameter and the first GTA acoustic feature parameter as training data, and taking the trained acoustic feature parameter prediction model as the student model.

The end-to-end speech synthesis training system based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: the knowledge distillation technology is adopted, a teacher model is trained by using a teacher model training module, the student model is trained by using acoustic characteristic parameters predicted by the teacher model as input, the student model is trained, and the end-to-end speech synthesis is performed by using the trained student model, so that the problem of poor hearing of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of an end-to-end speech synthesis training method based on knowledge distillation in an embodiment of the invention;

FIG. 2 is a block diagram of an end-to-end speech synthesis training system based on knowledge distillation in accordance with an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides an end-to-end voice synthesis training method based on knowledge distillation, which comprises the following steps as shown in fig. 1:

step 1: acquiring original training data;

step 2: training a teacher model by utilizing the original training data;

The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has obvious reduction of the synthesis effect outside the set, and one important reason is that the model training is not matched with the test. The decoding submodule (Decoder) of the acoustic feature parameter prediction model uses the GT acoustic feature parameter of the previous frame as an input in training, and uses the prediction output of the Decoder of the previous frame as a current input in testing, and the mismatch can cause the prediction precision of the acoustic feature parameter outside the set to be poor in testing, so that the hearing of the synthesized speech outside the set is poor.

The knowledge distillation principle is applied to training of an end-to-end voice synthesis system, after original training data are acquired, a teacher model is trained by utilizing the original training data, and then characteristic parameters predicted by the teacher model are used as training data to train a student model; finally, the trained student model is used for predicting acoustic characteristic parameters so as to perform end-to-end speech synthesis.

In the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.

The beneficial effects of the technical scheme are as follows: the knowledge distillation method is adopted, a teacher model is trained firstly, then acoustic characteristic parameters predicted by the teacher model are used as input, a student model is trained, and finally the trained student model is used for carrying out end-to-end speech synthesis, so that the problem of poor hearing feeling of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.

In one embodiment, the step 2: training the teacher model by using the original training data to execute the following steps:

The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio are called GT (Ground Truth) acoustic feature parameters.

Further, in the step S22, when training a decoding submodule (Decoder) of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.

The beneficial effects of the technical scheme are as follows: specific steps for training a teacher model using raw training data are provided.

In one embodiment, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training the student model, and executing the following steps:

The working principle of the technical scheme is as follows: GTA (Ground Truth Align) means that the GT acoustic feature parameters of the previous frame are used as input to predict the acoustic feature of the current frame when reasoning is done in the decoding submodule (Decoder). And predicting the generated intra-set acoustic characteristic parameters in a GTA mode, namely a first GTA acoustic characteristic parameter. The teacher model is utilized to predict the acoustic characteristic parameters in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be ensured to be aligned in time length, and the problem of time length alignment of data is solved.

The beneficial effects of the technical scheme are as follows: the specific step of training the student model by taking the characteristic parameters predicted by the teacher model as training data is provided, so that the GT acoustic characteristics and the predicted acoustic characteristic parameters can be ensured to be aligned in time length, and the problem of alignment of data time length is solved.

In one embodiment, the step 4: and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:

The working principle of the technical scheme is as follows: firstly, inputting an intra-set training text by using the student model obtained in the step 3, and predicting and generating intra-set acoustic characteristic parameters by adopting a GTA mode; then training a neural network vocoder by using training audio in the original training data and second GTA acoustic characteristic parameters predicted by the student model as inputs; and finally, adopting a student model as an acoustic characteristic parameter prediction model, and adopting the neural network vocoder as a synthesizer to obtain the end-to-end voice synthesis system for final use.

The beneficial effects of the technical scheme are as follows: specific steps for end-to-end speech synthesis using a trained student model are provided.

As shown in fig. 2, an embodiment of the present invention provides an end-to-end speech synthesis training system based on knowledge distillation, including:

an acquisition module 201, configured to acquire original training data;

a teacher model training module 202, configured to train a teacher model using the original training data;

the student model training module 203 is configured to train the student model by using the acoustic feature parameters predicted by the teacher model as training data;

and the speech synthesis module 204 is configured to perform end-to-end speech synthesis by using the trained student model.

In the invention, knowledge distillation principle is applied to training of an end-to-end voice synthesis system, and an acquisition module 201 acquires original training data; the teacher model training module 202 trains the teacher model using the raw training data; the student model training module 203 uses acoustic feature parameters predicted by the teacher model as training data to train the student model; the speech synthesis module 204 is configured to predict acoustic feature parameters using the trained student model for end-to-end speech synthesis.

Wherein, the original training data acquired by the acquisition module 201 includes training audio and pronunciation text corresponding to the training audio.

The beneficial effects of the technical scheme are as follows: the knowledge distillation technology is adopted, a teacher model is trained by using a teacher model training module, the student model is trained by using acoustic characteristic parameters predicted by the teacher model as input, the student model is trained, and the end-to-end speech synthesis is performed by using the trained student model, so that the problem of poor hearing of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.

In one embodiment, the teacher model training module 202 includes:

The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio by the GT acoustic feature parameter extraction unit are referred to as GT (Ground Truth) acoustic feature parameters.

Further, the teacher model training unit uses the GT acoustic features of the current frame as a target output and the GT acoustic features of the previous frame as an input when training a decoding sub-module (Decoder) of the teacher model.

The beneficial effects of the technical scheme are as follows: by means of the GT acoustic feature parameter extraction unit and the teacher model training unit, training of the teacher model can be achieved.

In one embodiment, the student model training module 203 includes:

Further, when the student model training unit trains the decoding submodule of the student model, the first GTA acoustic characteristic parameter of the previous frame is adopted as input, and the GT acoustic characteristic parameter of the current frame is adopted as target output.

The beneficial effects of the technical scheme are as follows: by means of the first GTA acoustic feature parameter prediction unit and the student model training unit, training of a student model can be achieved, the fact that GT acoustic features and predicted acoustic feature parameters are aligned in time length can be guaranteed, and therefore the problem of alignment of data time length is solved.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An end-to-end speech synthesis training method based on knowledge distillation, characterized in that the method performs the following steps:

step 1: acquiring original training data;

step 2: training a teacher model by utilizing the original training data;

step 4: performing end-to-end speech synthesis by using the trained student model;

in the step 1, the original training data comprises training audio and pronunciation text corresponding to the training audio;

the step 2: training the teacher model by using the original training data to execute the following steps:

step S22: training an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameters in the original training data as training data, wherein the trained acoustic feature parameter prediction model is used as the teacher model;

the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training the student model, and executing the following steps:

step S32: training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameters and the first GTA acoustic feature parameters as training data, wherein the trained acoustic feature parameter prediction model is used as the student model;

the step 4: and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:

2. The method of claim 1, wherein in said step S22, in training the decoding submodule of the teacher model, the GT acoustic features of the current frame are used as target outputs and the GT acoustic features of the previous frame are used as inputs.

3. The method of claim 1, wherein in step S32, in training the decoding submodule of the student model, the first GTA acoustic feature parameter of the previous frame is taken as input, and the GT acoustic feature parameter of the current frame is taken as target output.

4. An end-to-end speech synthesis training system based on knowledge distillation, comprising:

the acquisition module is used for acquiring the original training data; the original training data comprises training audio and pronunciation text corresponding to the training audio;

the voice synthesis module is used for carrying out end-to-end voice synthesis by utilizing the trained student model;

the teacher model training module comprises:

a teacher model training unit, configured to train an acoustic feature parameter prediction model using the pronunciation text and the extracted GT acoustic feature parameter in the original training data as training data, where the trained acoustic feature parameter prediction model is used as the teacher model;

the student model training module comprises:

the student model training unit is used for training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameter and the first GTA acoustic feature parameter as training data, and the trained acoustic feature parameter prediction model is used as the student model;

and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:

the student model is used as an acoustic characteristic parameter prediction model, the in-set training pronunciation text is input into the student model, and the in-set acoustic characteristic parameters are predicted and generated in a GTA mode to obtain second GTA acoustic characteristic parameters;

training a neural network vocoder by using the training audio and the second GTA acoustic feature parameter predicted by the student model as inputs;

and adopting the neural network vocoder as a voice synthesizer to perform end-to-end voice synthesis.