CN112002303A - End-to-end speech synthesis training method and system based on knowledge distillation - Google Patents

End-to-end speech synthesis training method and system based on knowledge distillation Download PDF

Info

Publication number
CN112002303A
CN112002303A CN202010718085.2A CN202010718085A CN112002303A CN 112002303 A CN112002303 A CN 112002303A CN 202010718085 A CN202010718085 A CN 202010718085A CN 112002303 A CN112002303 A CN 112002303A
Authority
CN
China
Prior art keywords
training
model
acoustic characteristic
acoustic
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010718085.2A
Other languages
Chinese (zh)
Other versions
CN112002303B (en
Inventor
贺来朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010718085.2A priority Critical patent/CN112002303B/en
Publication of CN112002303A publication Critical patent/CN112002303A/en
Application granted granted Critical
Publication of CN112002303B publication Critical patent/CN112002303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a knowledge distillation-based end-to-end speech synthesis training method and a knowledge distillation-based end-to-end speech synthesis training system, wherein the method comprises the following steps: step 1: acquiring original training data; step 2: training a teacher model by using the original training data; and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model; and 4, step 4: and performing end-to-end speech synthesis by using the trained student model. According to the method, a knowledge distillation method is adopted, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end voice synthesis is carried out by using the trained student model, so that the problem of poor hearing of the out-of-collection synthesized voice caused by mismatching of training and testing can be effectively solved.

Description

End-to-end speech synthesis training method and system based on knowledge distillation
Technical Field
The invention relates to the technical field of speech synthesis, in particular to an end-to-end speech synthesis training method and system based on knowledge distillation.
Background
At present, an end-to-end speech synthesis system generally comprises an acoustic feature parameter prediction module and a synthesizer module, wherein the acoustic feature parameter prediction module generally adopts a sequence-to-sequence modeling method and comprises sub-modules such as Embedding, Encoder-Decoder and Post-Net. The synthesizer module typically employs a vocoder based on acoustic signal processing, or a neural network vocoder. And the original training data used for training the end-to-end synthesis system comprises audio data and corresponding pronunciation texts, wherein the acoustic characteristic parameter prediction module is trained by the pronunciation text data and acoustic characteristic parameters extracted from the audio.
A Decoder submodule in the acoustic characteristic parameter prediction module takes GT acoustic characteristic parameter of a previous frame as the input of a current frame during training; during testing, the Decoder predicted output of the previous frame is used as the input of the current frame. Because the model prediction always has errors, the GT acoustic characteristic parameter and the characteristic parameter predicted by the model are respectively used as input during the model training and the model testing, and the problem of mismatching exists, which can lead to the deterioration of the prediction precision of the out-of-set acoustic characteristic parameter during the testing and further lead to the deterioration of the hearing sense of the out-of-set synthesized speech.
Disclosure of Invention
The invention provides a knowledge distillation-based end-to-end speech synthesis training method and system, which are used for avoiding the problem of bad hearing of the synthesized speech outside the set caused by mismatching of training and testing.
The invention provides an end-to-end speech synthesis training method based on knowledge distillation, which comprises the following steps:
step 1: acquiring original training data;
step 2: training a teacher model by using the original training data;
and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
and 4, step 4: and performing end-to-end speech synthesis by using the trained student model.
Further, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.
Further, the step 2: training a teacher model by using the original training data, and executing the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic characteristic parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the teacher model.
Further, in step S22, when training the decoding submodule of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.
Further, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:
step S31: inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic characteristic parameter prediction model by using the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the student model.
Further, in step S32, when training the decoding submodule of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.
Further, the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, an in-set training pronunciation text is input into the student model, and an in-set acoustic characteristic parameter is predicted and generated in a GTA mode to obtain a second GTA acoustic characteristic parameter;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic characteristic parameter predicted by the student model as input;
step S43: and performing end-to-end voice synthesis by using the neural network vocoder as a voice synthesizer.
The end-to-end speech synthesis training method based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: by adopting a knowledge distillation method, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end speech synthesis is carried out by utilizing the trained student model, so that the problem of poor hearing of the out-of-focus synthesized speech caused by mismatching of training and testing can be effectively solved.
The invention also provides an end-to-end speech synthesis training system based on knowledge distillation, comprising:
the acquisition module is used for acquiring original training data;
the teacher model training module is used for training a teacher model by using the original training data;
the student model training module is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
and the voice synthesis module is used for performing end-to-end voice synthesis by using the trained student model.
Further, the teacher model training module comprises:
a GT acoustic feature parameter extraction unit, configured to extract GT acoustic feature parameters from the training audio in the original training data;
and the teacher model training unit is used for training an acoustic feature parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic feature parameters as training data, and using the trained acoustic feature parameter prediction model as the teacher model.
Further, the student model training module comprises:
the first GTA acoustic characteristic parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters in a GTA mode to obtain first GTA acoustic characteristic parameters;
and the student model training unit is used for training an acoustic characteristic parameter prediction model by adopting the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, and taking the trained acoustic characteristic parameter prediction model as the student model.
The end-to-end speech synthesis training system based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: the knowledge distillation technology is adopted, the teacher model is trained through the teacher model training module, the student model training module is utilized, acoustic characteristic parameters predicted by the teacher model serve as input, the student model is trained, end-to-end voice synthesis is carried out through the trained student model, and the problem that the hearing of the out-of-collection synthesized voice is poor due to mismatching of training and testing can be effectively solved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic flow chart of a knowledge-based distillation end-to-end speech synthesis training method according to an embodiment of the present invention;
FIG. 2 is a block diagram of an end-to-end speech synthesis training system based on knowledge distillation according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
An embodiment of the present invention provides an end-to-end speech synthesis training method based on knowledge distillation, as shown in fig. 1, the method performs the following steps:
step 1: acquiring original training data;
step 2: training a teacher model by using the original training data;
and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
and 4, step 4: and performing end-to-end speech synthesis by using the trained student model.
The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has a significantly reduced effect of the out-of-set synthesis, and an important reason is that model training and testing are not matched. A decoding submodule (Decoder) of the acoustic characteristic parameter prediction model uses GT acoustic characteristic parameter of a previous frame as input during training, and uses the prediction output of the Decoder of the previous frame as current input during testing, and the mismatching can cause the prediction precision of the out-of-set acoustic characteristic parameter to be poor during testing, thereby causing the hearing sense of the out-of-set synthesized speech to be poor.
The knowledge distillation principle is applied to the training of an end-to-end speech synthesis system, after original training data are obtained, a teacher model is trained by using the original training data, and then a student model is trained by using characteristic parameters predicted by the teacher model as training data; finally, the trained student model is used for predicting the acoustic characteristic parameters so as to carry out end-to-end speech synthesis.
Wherein, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.
The beneficial effects of the above technical scheme are: by adopting a knowledge distillation method, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end speech synthesis is carried out by utilizing the trained student model, so that the problem of poor hearing of the out-of-focus synthesized speech caused by mismatching of training and testing can be effectively solved.
In one embodiment, the step 2: training a teacher model by using the original training data, and executing the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic characteristic parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the teacher model.
The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio are referred to as gt (ground truth) acoustic feature parameters.
Further, in the step S22, when training the decoding sub-module (Decoder) of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.
The beneficial effects of the above technical scheme are: specific steps are provided for training a teacher's model using raw training data.
In one embodiment, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:
step S31: inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic characteristic parameter prediction model by using the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the student model.
The working principle of the technical scheme is as follows: the gta (group try align) method is to predict the acoustic feature of the current frame by using the GT acoustic feature parameter of the previous frame as input when reasoning in the decoding sub-module (Decoder). And predicting the generated in-set acoustic characteristic parameters by adopting a GTA mode, and calling the in-set acoustic characteristic parameters as first GTA acoustic characteristic parameters. The acoustic characteristic parameters are predicted by using a teacher model in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.
Further, in step S32, when training the decoding submodule of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.
The beneficial effects of the above technical scheme are: the specific steps of training the student model by using the characteristic parameters predicted by the teacher model as training data are provided, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.
In one embodiment, the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, an in-set training pronunciation text is input into the student model, and an in-set acoustic characteristic parameter is predicted and generated in a GTA mode to obtain a second GTA acoustic characteristic parameter;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic characteristic parameter predicted by the student model as input;
step S43: and performing end-to-end voice synthesis by using the neural network vocoder as a voice synthesizer.
The working principle of the technical scheme is as follows: firstly, inputting an in-set training text by using the student model obtained in the step 3, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode; then, training a neural network vocoder by taking training audio in original training data and a second GTA acoustic characteristic parameter predicted by a student model as input; and finally, adopting a student model as an acoustic characteristic parameter prediction model, and adopting the neural network vocoder as a synthesizer, namely the finally used end-to-end speech synthesis system.
The beneficial effects of the above technical scheme are: specific steps are provided for end-to-end speech synthesis using trained student models.
As shown in fig. 2, an embodiment of the present invention provides an end-to-end speech synthesis training system based on knowledge distillation, including:
an obtaining module 201, configured to obtain original training data;
a teacher model training module 202, configured to train a teacher model by using the original training data;
the student model training module 203 is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
and the speech synthesis module 204 is configured to perform end-to-end speech synthesis by using the trained student model.
The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has a significantly reduced effect of the out-of-set synthesis, and an important reason is that model training and testing are not matched. A decoding submodule (Decoder) of the acoustic characteristic parameter prediction model uses GT acoustic characteristic parameter of a previous frame as input during training, and uses the prediction output of the Decoder of the previous frame as current input during testing, and the mismatching can cause the prediction precision of the out-of-set acoustic characteristic parameter to be poor during testing, thereby causing the hearing sense of the out-of-set synthesized speech to be poor.
The invention applies the knowledge distillation principle to the training of the end-to-end speech synthesis system, and the acquisition module 201 acquires the original training data; the teacher model training module 202 trains the teacher model by using the original training data; the student model training module 203 takes the acoustic characteristic parameters predicted by the teacher model as training data to train the student model; and the speech synthesis module 204 is used for making a prediction of the acoustic characteristic parameters by using the trained student model so as to perform end-to-end speech synthesis.
The original training data acquired by the acquisition module 201 includes training audio and pronunciation text corresponding to the training audio.
The beneficial effects of the above technical scheme are: the knowledge distillation technology is adopted, the teacher model is trained through the teacher model training module, the student model training module is utilized, acoustic characteristic parameters predicted by the teacher model serve as input, the student model is trained, end-to-end voice synthesis is carried out through the trained student model, and the problem that the hearing of the out-of-collection synthesized voice is poor due to mismatching of training and testing can be effectively solved.
In one embodiment, the teacher model training module 202 includes:
a GT acoustic feature parameter extraction unit, configured to extract GT acoustic feature parameters from the training audio in the original training data;
and the teacher model training unit is used for training an acoustic feature parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic feature parameters as training data, and using the trained acoustic feature parameter prediction model as the teacher model.
The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio by the GT acoustic feature parameter extraction unit are referred to as GT (ground truth) acoustic feature parameters.
Further, a teacher model training unit uses the GT acoustic feature of the current frame as a target output and the GT acoustic feature of the previous frame as an input when training a decoding sub-module (Decoder) of the teacher model.
The beneficial effects of the above technical scheme are: by means of the GT acoustic feature parameter extraction unit and the teacher model training unit, training of the teacher model can be achieved.
In one embodiment, the student model training module 203 comprises:
the first GTA acoustic characteristic parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters in a GTA mode to obtain first GTA acoustic characteristic parameters;
and the student model training unit is used for training an acoustic characteristic parameter prediction model by adopting the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, and taking the trained acoustic characteristic parameter prediction model as the student model.
The working principle of the technical scheme is as follows: the gta (group try align) method is to predict the acoustic feature of the current frame by using the GT acoustic feature parameter of the previous frame as input when reasoning in the decoding sub-module (Decoder). And predicting the generated in-set acoustic characteristic parameters by adopting a GTA mode, and calling the in-set acoustic characteristic parameters as first GTA acoustic characteristic parameters. The acoustic characteristic parameters are predicted by using a teacher model in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.
Further, when the student model training unit trains the decoding sub-modules of the student model, the student model training unit takes the first GTA acoustic feature parameter of the previous frame as input, and takes the GT acoustic feature parameter of the current frame as target output.
The beneficial effects of the above technical scheme are: by means of the first GTA acoustic characteristic parameter prediction unit and the student model training unit, training of a student model can be achieved, and GT acoustic characteristic and predicted acoustic characteristic parameters can be aligned in time length, so that the problem of data time length alignment is solved.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A knowledge-distillation-based end-to-end speech synthesis training method, characterized in that the method performs the steps of:
step 1: acquiring original training data;
step 2: training a teacher model by using the original training data;
and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
and 4, step 4: and performing end-to-end speech synthesis by using the trained student model.
2. The method of claim 1, wherein in step 1, the raw training data includes training audio and pronunciation text corresponding to the training audio.
3. The method of claim 2, wherein the step 2: training a teacher model by using the original training data, and executing the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic characteristic parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the teacher model.
4. The method according to claim 3, wherein in step S22, GT acoustic feature of the current frame is used as a target output and GT acoustic feature of the previous frame is used as an input in training the decoding sub-module of the teacher model.
5. The method of claim 3, wherein step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:
step S31: inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic characteristic parameter prediction model by using the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the student model.
6. The method of claim 5, wherein in step S32, in training the decoding sub-module of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.
7. The method of claim 1, wherein the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, an in-set training pronunciation text is input into the student model, and an in-set acoustic characteristic parameter is predicted and generated in a GTA mode to obtain a second GTA acoustic characteristic parameter;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic characteristic parameter predicted by the student model as input;
step S43: and performing end-to-end voice synthesis by using the neural network vocoder as a voice synthesizer.
8. A knowledge-distillation-based end-to-end speech synthesis training system, comprising:
the acquisition module is used for acquiring original training data;
the teacher model training module is used for training a teacher model by using the original training data;
the student model training module is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
and the voice synthesis module is used for performing end-to-end voice synthesis by using the trained student model.
9. The system of claim 8, wherein the teacher model training module comprises:
a GT acoustic feature parameter extraction unit, configured to extract GT acoustic feature parameters from the training audio in the original training data;
and the teacher model training unit is used for training an acoustic feature parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic feature parameters as training data, and using the trained acoustic feature parameter prediction model as the teacher model.
10. The system of claim 8, wherein the student model training module comprises:
the first GTA acoustic characteristic parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters in a GTA mode to obtain first GTA acoustic characteristic parameters;
and the student model training unit is used for training an acoustic characteristic parameter prediction model by adopting the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, and taking the trained acoustic characteristic parameter prediction model as the student model.
CN202010718085.2A 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation Active CN112002303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010718085.2A CN112002303B (en) 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010718085.2A CN112002303B (en) 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation

Publications (2)

Publication Number Publication Date
CN112002303A true CN112002303A (en) 2020-11-27
CN112002303B CN112002303B (en) 2023-12-15

Family

ID=73467751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010718085.2A Active CN112002303B (en) 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation

Country Status (1)

Country Link
CN (1) CN112002303B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611311A (en) * 2021-08-20 2021-11-05 天津讯飞极智科技有限公司 Voice transcription method, device, recording equipment and storage medium
WO2022141842A1 (en) * 2020-12-29 2022-07-07 平安科技(深圳)有限公司 Deep learning-based speech training method and apparatus, device, and storage medium
CN115376484A (en) * 2022-08-18 2022-11-22 天津大学 Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190180732A1 (en) * 2017-10-19 2019-06-13 Baidu Usa Llc Systems and methods for parallel wave generation in end-to-end text-to-speech

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190180732A1 (en) * 2017-10-19 2019-06-13 Baidu Usa Llc Systems and methods for parallel wave generation in end-to-end text-to-speech

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
R. LIU等: "Teacher-Student Training For Robust Tacotron-Based TTS", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), BARCELONA, SPAIN, 2020, pages 6274 - 6278 *
Z. -C. LIU等: "Statistical Parametric Speech Synthesis Using Generalized Distillation Framework", IEEE SIGNAL PROCESSING LETTERS, vol. 25, no. 05, pages 695 - 699 *
刘正晨: "结合发音特征与深度学习的语音生成方法研究", 中国博士学位论文全文数据库 信息科技辑, no. 10, pages 136 - 28 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022141842A1 (en) * 2020-12-29 2022-07-07 平安科技(深圳)有限公司 Deep learning-based speech training method and apparatus, device, and storage medium
CN113611311A (en) * 2021-08-20 2021-11-05 天津讯飞极智科技有限公司 Voice transcription method, device, recording equipment and storage medium
CN115376484A (en) * 2022-08-18 2022-11-22 天津大学 Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction

Also Published As

Publication number Publication date
CN112002303B (en) 2023-12-15

Similar Documents

Publication Publication Date Title
CN112002303B (en) End-to-end speech synthesis training method and system based on knowledge distillation
CN111883110B (en) Acoustic model training method, system, equipment and medium for speech recognition
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
CN110136691B (en) Speech synthesis model training method and device, electronic equipment and storage medium
WO2021128256A1 (en) Voice conversion method, apparatus and device, and storage medium
CN107871496B (en) Speech recognition method and device
CN106782603B (en) Intelligent voice evaluation method and system
CN112017644A (en) Sound transformation system, method and application
CN108053823A (en) A kind of speech recognition system and method
CN105654939A (en) Voice synthesis method based on voice vector textual characteristics
CN111128211B (en) Voice separation method and device
CN112634866B (en) Speech synthesis model training and speech synthesis method, device, equipment and medium
CN110390928B (en) Method and system for training speech synthesis model of automatic expansion corpus
CN113053357B (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN112329451B (en) Sign language action video generation method, device, equipment and storage medium
CN111986646B (en) Dialect synthesis method and system based on small corpus
CN112364125B (en) Text information extraction system and method combining reading course learning mechanism
CN111915940A (en) Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation
CN109213970B (en) Method and device for generating notes
CN113450760A (en) Method and device for converting text into voice and electronic equipment
CN107464569A (en) Vocoder
CN116741144A (en) Voice tone conversion method and system
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
CN111785236A (en) Automatic composition method based on motivational extraction model and neural network
CN114783424A (en) Text corpus screening method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant