CN112002303A - End-to-end speech synthesis training method and system based on knowledge distillation - Google Patents
End-to-end speech synthesis training method and system based on knowledge distillation Download PDFInfo
- Publication number
- CN112002303A CN112002303A CN202010718085.2A CN202010718085A CN112002303A CN 112002303 A CN112002303 A CN 112002303A CN 202010718085 A CN202010718085 A CN 202010718085A CN 112002303 A CN112002303 A CN 112002303A
- Authority
- CN
- China
- Prior art keywords
- training
- model
- acoustic characteristic
- acoustic
- training data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 197
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 51
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000013140 knowledge distillation Methods 0.000 title claims abstract description 22
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 abstract description 15
- 230000009286 beneficial effect Effects 0.000 description 9
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000006866 deterioration Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides a knowledge distillation-based end-to-end speech synthesis training method and a knowledge distillation-based end-to-end speech synthesis training system, wherein the method comprises the following steps: step 1: acquiring original training data; step 2: training a teacher model by using the original training data; and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model; and 4, step 4: and performing end-to-end speech synthesis by using the trained student model. According to the method, a knowledge distillation method is adopted, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end voice synthesis is carried out by using the trained student model, so that the problem of poor hearing of the out-of-collection synthesized voice caused by mismatching of training and testing can be effectively solved.
Description
Technical Field
The invention relates to the technical field of speech synthesis, in particular to an end-to-end speech synthesis training method and system based on knowledge distillation.
Background
At present, an end-to-end speech synthesis system generally comprises an acoustic feature parameter prediction module and a synthesizer module, wherein the acoustic feature parameter prediction module generally adopts a sequence-to-sequence modeling method and comprises sub-modules such as Embedding, Encoder-Decoder and Post-Net. The synthesizer module typically employs a vocoder based on acoustic signal processing, or a neural network vocoder. And the original training data used for training the end-to-end synthesis system comprises audio data and corresponding pronunciation texts, wherein the acoustic characteristic parameter prediction module is trained by the pronunciation text data and acoustic characteristic parameters extracted from the audio.
A Decoder submodule in the acoustic characteristic parameter prediction module takes GT acoustic characteristic parameter of a previous frame as the input of a current frame during training; during testing, the Decoder predicted output of the previous frame is used as the input of the current frame. Because the model prediction always has errors, the GT acoustic characteristic parameter and the characteristic parameter predicted by the model are respectively used as input during the model training and the model testing, and the problem of mismatching exists, which can lead to the deterioration of the prediction precision of the out-of-set acoustic characteristic parameter during the testing and further lead to the deterioration of the hearing sense of the out-of-set synthesized speech.
Disclosure of Invention
The invention provides a knowledge distillation-based end-to-end speech synthesis training method and system, which are used for avoiding the problem of bad hearing of the synthesized speech outside the set caused by mismatching of training and testing.
The invention provides an end-to-end speech synthesis training method based on knowledge distillation, which comprises the following steps:
step 1: acquiring original training data;
step 2: training a teacher model by using the original training data;
and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
and 4, step 4: and performing end-to-end speech synthesis by using the trained student model.
Further, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.
Further, the step 2: training a teacher model by using the original training data, and executing the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic characteristic parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the teacher model.
Further, in step S22, when training the decoding submodule of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.
Further, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:
step S31: inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic characteristic parameter prediction model by using the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the student model.
Further, in step S32, when training the decoding submodule of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.
Further, the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, an in-set training pronunciation text is input into the student model, and an in-set acoustic characteristic parameter is predicted and generated in a GTA mode to obtain a second GTA acoustic characteristic parameter;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic characteristic parameter predicted by the student model as input;
step S43: and performing end-to-end voice synthesis by using the neural network vocoder as a voice synthesizer.
The end-to-end speech synthesis training method based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: by adopting a knowledge distillation method, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end speech synthesis is carried out by utilizing the trained student model, so that the problem of poor hearing of the out-of-focus synthesized speech caused by mismatching of training and testing can be effectively solved.
The invention also provides an end-to-end speech synthesis training system based on knowledge distillation, comprising:
the acquisition module is used for acquiring original training data;
the teacher model training module is used for training a teacher model by using the original training data;
the student model training module is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
and the voice synthesis module is used for performing end-to-end voice synthesis by using the trained student model.
Further, the teacher model training module comprises:
a GT acoustic feature parameter extraction unit, configured to extract GT acoustic feature parameters from the training audio in the original training data;
and the teacher model training unit is used for training an acoustic feature parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic feature parameters as training data, and using the trained acoustic feature parameter prediction model as the teacher model.
Further, the student model training module comprises:
the first GTA acoustic characteristic parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters in a GTA mode to obtain first GTA acoustic characteristic parameters;
and the student model training unit is used for training an acoustic characteristic parameter prediction model by adopting the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, and taking the trained acoustic characteristic parameter prediction model as the student model.
The end-to-end speech synthesis training system based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: the knowledge distillation technology is adopted, the teacher model is trained through the teacher model training module, the student model training module is utilized, acoustic characteristic parameters predicted by the teacher model serve as input, the student model is trained, end-to-end voice synthesis is carried out through the trained student model, and the problem that the hearing of the out-of-collection synthesized voice is poor due to mismatching of training and testing can be effectively solved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic flow chart of a knowledge-based distillation end-to-end speech synthesis training method according to an embodiment of the present invention;
FIG. 2 is a block diagram of an end-to-end speech synthesis training system based on knowledge distillation according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
An embodiment of the present invention provides an end-to-end speech synthesis training method based on knowledge distillation, as shown in fig. 1, the method performs the following steps:
step 1: acquiring original training data;
step 2: training a teacher model by using the original training data;
and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
and 4, step 4: and performing end-to-end speech synthesis by using the trained student model.
The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has a significantly reduced effect of the out-of-set synthesis, and an important reason is that model training and testing are not matched. A decoding submodule (Decoder) of the acoustic characteristic parameter prediction model uses GT acoustic characteristic parameter of a previous frame as input during training, and uses the prediction output of the Decoder of the previous frame as current input during testing, and the mismatching can cause the prediction precision of the out-of-set acoustic characteristic parameter to be poor during testing, thereby causing the hearing sense of the out-of-set synthesized speech to be poor.
The knowledge distillation principle is applied to the training of an end-to-end speech synthesis system, after original training data are obtained, a teacher model is trained by using the original training data, and then a student model is trained by using characteristic parameters predicted by the teacher model as training data; finally, the trained student model is used for predicting the acoustic characteristic parameters so as to carry out end-to-end speech synthesis.
Wherein, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.
The beneficial effects of the above technical scheme are: by adopting a knowledge distillation method, a teacher model is trained, acoustic characteristic parameters predicted by the teacher model are used as input to train a student model, and finally end-to-end speech synthesis is carried out by utilizing the trained student model, so that the problem of poor hearing of the out-of-focus synthesized speech caused by mismatching of training and testing can be effectively solved.
In one embodiment, the step 2: training a teacher model by using the original training data, and executing the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic characteristic parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the teacher model.
The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio are referred to as gt (ground truth) acoustic feature parameters.
Further, in the step S22, when training the decoding sub-module (Decoder) of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.
The beneficial effects of the above technical scheme are: specific steps are provided for training a teacher's model using raw training data.
In one embodiment, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:
step S31: inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic characteristic parameter prediction model by using the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the student model.
The working principle of the technical scheme is as follows: the gta (group try align) method is to predict the acoustic feature of the current frame by using the GT acoustic feature parameter of the previous frame as input when reasoning in the decoding sub-module (Decoder). And predicting the generated in-set acoustic characteristic parameters by adopting a GTA mode, and calling the in-set acoustic characteristic parameters as first GTA acoustic characteristic parameters. The acoustic characteristic parameters are predicted by using a teacher model in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.
Further, in step S32, when training the decoding submodule of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.
The beneficial effects of the above technical scheme are: the specific steps of training the student model by using the characteristic parameters predicted by the teacher model as training data are provided, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.
In one embodiment, the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, an in-set training pronunciation text is input into the student model, and an in-set acoustic characteristic parameter is predicted and generated in a GTA mode to obtain a second GTA acoustic characteristic parameter;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic characteristic parameter predicted by the student model as input;
step S43: and performing end-to-end voice synthesis by using the neural network vocoder as a voice synthesizer.
The working principle of the technical scheme is as follows: firstly, inputting an in-set training text by using the student model obtained in the step 3, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode; then, training a neural network vocoder by taking training audio in original training data and a second GTA acoustic characteristic parameter predicted by a student model as input; and finally, adopting a student model as an acoustic characteristic parameter prediction model, and adopting the neural network vocoder as a synthesizer, namely the finally used end-to-end speech synthesis system.
The beneficial effects of the above technical scheme are: specific steps are provided for end-to-end speech synthesis using trained student models.
As shown in fig. 2, an embodiment of the present invention provides an end-to-end speech synthesis training system based on knowledge distillation, including:
an obtaining module 201, configured to obtain original training data;
a teacher model training module 202, configured to train a teacher model by using the original training data;
the student model training module 203 is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
and the speech synthesis module 204 is configured to perform end-to-end speech synthesis by using the trained student model.
The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has a significantly reduced effect of the out-of-set synthesis, and an important reason is that model training and testing are not matched. A decoding submodule (Decoder) of the acoustic characteristic parameter prediction model uses GT acoustic characteristic parameter of a previous frame as input during training, and uses the prediction output of the Decoder of the previous frame as current input during testing, and the mismatching can cause the prediction precision of the out-of-set acoustic characteristic parameter to be poor during testing, thereby causing the hearing sense of the out-of-set synthesized speech to be poor.
The invention applies the knowledge distillation principle to the training of the end-to-end speech synthesis system, and the acquisition module 201 acquires the original training data; the teacher model training module 202 trains the teacher model by using the original training data; the student model training module 203 takes the acoustic characteristic parameters predicted by the teacher model as training data to train the student model; and the speech synthesis module 204 is used for making a prediction of the acoustic characteristic parameters by using the trained student model so as to perform end-to-end speech synthesis.
The original training data acquired by the acquisition module 201 includes training audio and pronunciation text corresponding to the training audio.
The beneficial effects of the above technical scheme are: the knowledge distillation technology is adopted, the teacher model is trained through the teacher model training module, the student model training module is utilized, acoustic characteristic parameters predicted by the teacher model serve as input, the student model is trained, end-to-end voice synthesis is carried out through the trained student model, and the problem that the hearing of the out-of-collection synthesized voice is poor due to mismatching of training and testing can be effectively solved.
In one embodiment, the teacher model training module 202 includes:
a GT acoustic feature parameter extraction unit, configured to extract GT acoustic feature parameters from the training audio in the original training data;
and the teacher model training unit is used for training an acoustic feature parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic feature parameters as training data, and using the trained acoustic feature parameter prediction model as the teacher model.
The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio by the GT acoustic feature parameter extraction unit are referred to as GT (ground truth) acoustic feature parameters.
Further, a teacher model training unit uses the GT acoustic feature of the current frame as a target output and the GT acoustic feature of the previous frame as an input when training a decoding sub-module (Decoder) of the teacher model.
The beneficial effects of the above technical scheme are: by means of the GT acoustic feature parameter extraction unit and the teacher model training unit, training of the teacher model can be achieved.
In one embodiment, the student model training module 203 comprises:
the first GTA acoustic characteristic parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters in a GTA mode to obtain first GTA acoustic characteristic parameters;
and the student model training unit is used for training an acoustic characteristic parameter prediction model by adopting the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, and taking the trained acoustic characteristic parameter prediction model as the student model.
The working principle of the technical scheme is as follows: the gta (group try align) method is to predict the acoustic feature of the current frame by using the GT acoustic feature parameter of the previous frame as input when reasoning in the decoding sub-module (Decoder). And predicting the generated in-set acoustic characteristic parameters by adopting a GTA mode, and calling the in-set acoustic characteristic parameters as first GTA acoustic characteristic parameters. The acoustic characteristic parameters are predicted by using a teacher model in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be aligned in time length, and the problem of data time length alignment is solved.
Further, when the student model training unit trains the decoding sub-modules of the student model, the student model training unit takes the first GTA acoustic feature parameter of the previous frame as input, and takes the GT acoustic feature parameter of the current frame as target output.
The beneficial effects of the above technical scheme are: by means of the first GTA acoustic characteristic parameter prediction unit and the student model training unit, training of a student model can be achieved, and GT acoustic characteristic and predicted acoustic characteristic parameters can be aligned in time length, so that the problem of data time length alignment is solved.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. A knowledge-distillation-based end-to-end speech synthesis training method, characterized in that the method performs the steps of:
step 1: acquiring original training data;
step 2: training a teacher model by using the original training data;
and step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
and 4, step 4: and performing end-to-end speech synthesis by using the trained student model.
2. The method of claim 1, wherein in step 1, the raw training data includes training audio and pronunciation text corresponding to the training audio.
3. The method of claim 2, wherein the step 2: training a teacher model by using the original training data, and executing the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic characteristic parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the teacher model.
4. The method according to claim 3, wherein in step S22, GT acoustic feature of the current frame is used as a target output and GT acoustic feature of the previous frame is used as an input in training the decoding sub-module of the teacher model.
5. The method of claim 3, wherein step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training a student model and executing the following steps:
step S31: inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters by adopting a GTA (global GTA) mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic characteristic parameter prediction model by using the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, wherein the trained acoustic characteristic parameter prediction model is used as the student model.
6. The method of claim 5, wherein in step S32, in training the decoding sub-module of the student model, the first GTA acoustic feature parameter of the previous frame is used as an input, and the GT acoustic feature parameter of the current frame is used as a target output.
7. The method of claim 1, wherein the step 4: performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, an in-set training pronunciation text is input into the student model, and an in-set acoustic characteristic parameter is predicted and generated in a GTA mode to obtain a second GTA acoustic characteristic parameter;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic characteristic parameter predicted by the student model as input;
step S43: and performing end-to-end voice synthesis by using the neural network vocoder as a voice synthesizer.
8. A knowledge-distillation-based end-to-end speech synthesis training system, comprising:
the acquisition module is used for acquiring original training data;
the teacher model training module is used for training a teacher model by using the original training data;
the student model training module is used for training a student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
and the voice synthesis module is used for performing end-to-end voice synthesis by using the trained student model.
9. The system of claim 8, wherein the teacher model training module comprises:
a GT acoustic feature parameter extraction unit, configured to extract GT acoustic feature parameters from the training audio in the original training data;
and the teacher model training unit is used for training an acoustic feature parameter prediction model by using the pronunciation text in the original training data and the extracted GT acoustic feature parameters as training data, and using the trained acoustic feature parameter prediction model as the teacher model.
10. The system of claim 8, wherein the student model training module comprises:
the first GTA acoustic characteristic parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic characteristic parameters in a GTA mode to obtain first GTA acoustic characteristic parameters;
and the student model training unit is used for training an acoustic characteristic parameter prediction model by adopting the pronunciation text, the GT acoustic characteristic parameter and the first GTA acoustic characteristic parameter as training data, and taking the trained acoustic characteristic parameter prediction model as the student model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010718085.2A CN112002303B (en) | 2020-07-23 | 2020-07-23 | End-to-end speech synthesis training method and system based on knowledge distillation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010718085.2A CN112002303B (en) | 2020-07-23 | 2020-07-23 | End-to-end speech synthesis training method and system based on knowledge distillation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112002303A true CN112002303A (en) | 2020-11-27 |
CN112002303B CN112002303B (en) | 2023-12-15 |
Family
ID=73467751
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010718085.2A Active CN112002303B (en) | 2020-07-23 | 2020-07-23 | End-to-end speech synthesis training method and system based on knowledge distillation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112002303B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113611311A (en) * | 2021-08-20 | 2021-11-05 | 天津讯飞极智科技有限公司 | Voice transcription method, device, recording equipment and storage medium |
WO2022141842A1 (en) * | 2020-12-29 | 2022-07-07 | 平安科技(深圳)有限公司 | Deep learning-based speech training method and apparatus, device, and storage medium |
CN115376484A (en) * | 2022-08-18 | 2022-11-22 | 天津大学 | Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190180732A1 (en) * | 2017-10-19 | 2019-06-13 | Baidu Usa Llc | Systems and methods for parallel wave generation in end-to-end text-to-speech |
-
2020
- 2020-07-23 CN CN202010718085.2A patent/CN112002303B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190180732A1 (en) * | 2017-10-19 | 2019-06-13 | Baidu Usa Llc | Systems and methods for parallel wave generation in end-to-end text-to-speech |
Non-Patent Citations (3)
Title |
---|
R. LIU等: "Teacher-Student Training For Robust Tacotron-Based TTS", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), BARCELONA, SPAIN, 2020, pages 6274 - 6278 * |
Z. -C. LIU等: "Statistical Parametric Speech Synthesis Using Generalized Distillation Framework", IEEE SIGNAL PROCESSING LETTERS, vol. 25, no. 05, pages 695 - 699 * |
刘正晨: "结合发音特征与深度学习的语音生成方法研究", 中国博士学位论文全文数据库 信息科技辑, no. 10, pages 136 - 28 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022141842A1 (en) * | 2020-12-29 | 2022-07-07 | 平安科技(深圳)有限公司 | Deep learning-based speech training method and apparatus, device, and storage medium |
CN113611311A (en) * | 2021-08-20 | 2021-11-05 | 天津讯飞极智科技有限公司 | Voice transcription method, device, recording equipment and storage medium |
CN115376484A (en) * | 2022-08-18 | 2022-11-22 | 天津大学 | Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction |
Also Published As
Publication number | Publication date |
---|---|
CN112002303B (en) | 2023-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112002303B (en) | End-to-end speech synthesis training method and system based on knowledge distillation | |
CN111883110B (en) | Acoustic model training method, system, equipment and medium for speech recognition | |
CN109599093B (en) | Intelligent quality inspection keyword detection method, device and equipment and readable storage medium | |
CN110136691B (en) | Speech synthesis model training method and device, electronic equipment and storage medium | |
WO2021128256A1 (en) | Voice conversion method, apparatus and device, and storage medium | |
CN107871496B (en) | Speech recognition method and device | |
CN106782603B (en) | Intelligent voice evaluation method and system | |
CN112017644A (en) | Sound transformation system, method and application | |
CN108053823A (en) | A kind of speech recognition system and method | |
CN105654939A (en) | Voice synthesis method based on voice vector textual characteristics | |
CN111128211B (en) | Voice separation method and device | |
CN112634866B (en) | Speech synthesis model training and speech synthesis method, device, equipment and medium | |
CN110390928B (en) | Method and system for training speech synthesis model of automatic expansion corpus | |
CN113053357B (en) | Speech synthesis method, apparatus, device and computer readable storage medium | |
CN112329451B (en) | Sign language action video generation method, device, equipment and storage medium | |
CN111986646B (en) | Dialect synthesis method and system based on small corpus | |
CN112364125B (en) | Text information extraction system and method combining reading course learning mechanism | |
CN111915940A (en) | Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation | |
CN109213970B (en) | Method and device for generating notes | |
CN113450760A (en) | Method and device for converting text into voice and electronic equipment | |
CN107464569A (en) | Vocoder | |
CN116741144A (en) | Voice tone conversion method and system | |
CN115762471A (en) | Voice synthesis method, device, equipment and storage medium | |
CN111785236A (en) | Automatic composition method based on motivational extraction model and neural network | |
CN114783424A (en) | Text corpus screening method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |