CN112002303B - End-to-end speech synthesis training method and system based on knowledge distillation - Google Patents

End-to-end speech synthesis training method and system based on knowledge distillation Download PDF

Info

Publication number
CN112002303B
CN112002303B CN202010718085.2A CN202010718085A CN112002303B CN 112002303 B CN112002303 B CN 112002303B CN 202010718085 A CN202010718085 A CN 202010718085A CN 112002303 B CN112002303 B CN 112002303B
Authority
CN
China
Prior art keywords
training
model
acoustic feature
acoustic
gta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010718085.2A
Other languages
Chinese (zh)
Other versions
CN112002303A (en
Inventor
贺来朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010718085.2A priority Critical patent/CN112002303B/en
Publication of CN112002303A publication Critical patent/CN112002303A/en
Application granted granted Critical
Publication of CN112002303B publication Critical patent/CN112002303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides an end-to-end voice synthesis training method and system based on knowledge distillation, wherein the method comprises the following steps: step 1: acquiring original training data; step 2: training a teacher model by utilizing the original training data; step 3: taking acoustic characteristic parameters predicted by the teacher model as training data to train the student model; step 4: and performing end-to-end voice synthesis by using the trained student model. According to the method, a knowledge distillation method is adopted, a teacher model is trained firstly, acoustic characteristic parameters predicted by the teacher model are used as input, a student model is trained, and finally the trained student model is used for end-to-end speech synthesis, so that the problem of poor hearing of the synthesized speech outside a set caused by mismatching between training and testing can be effectively solved.

Description

End-to-end speech synthesis training method and system based on knowledge distillation
Technical Field
The invention relates to the technical field of speech synthesis, in particular to an end-to-end speech synthesis training method and system based on knowledge distillation.
Background
At present, an end-to-end voice synthesis system generally comprises an acoustic characteristic parameter prediction module and a synthesizer module, wherein the acoustic characteristic parameter prediction module generally adopts a sequence-to-sequence modeling method and comprises submodules such as Embedding, encoder-Decoder, post-Net and the like. Synthesizer modules typically employ vocoders based on acoustic signal processing, or neural network vocoders. And the original training data for training the end-to-end synthesis system comprises audio data and corresponding pronunciation text, wherein the acoustic feature parameter prediction module is trained by the pronunciation text data and acoustic feature parameters extracted from the audio.
The Decoder sub-module in the acoustic feature parameter prediction module takes the GT acoustic feature parameter of the previous frame as the input of the current frame during training; and in the test, the Decoder prediction output of the previous frame is used as the input of the current frame. Because the model prediction always has errors, the GT acoustic characteristic parameters and the characteristic parameters of the model prediction are used as inputs during the model training and the model testing respectively, and the problem of mismatching exists, which can lead to poor prediction precision of the acoustic characteristic parameters outside the set during the testing, and further lead to poor hearing of the synthesized speech outside the set.
Disclosure of Invention
The invention provides an end-to-end voice synthesis training method and system based on knowledge distillation, which are used for avoiding the problem of poor hearing of synthesized voice outside a set caused by mismatching of training and testing.
The invention provides an end-to-end voice synthesis training method based on knowledge distillation, which comprises the following steps of:
step 1: acquiring original training data;
step 2: training a teacher model by utilizing the original training data;
step 3: taking acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
step 4: and performing end-to-end voice synthesis by using the trained student model.
Further, in the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.
Further, the step 2: training the teacher model by using the original training data to execute the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameters in the original training data as training data, wherein the trained acoustic feature parameter prediction model is used as the teacher model.
Further, in the step S22, when the decoding submodule of the teacher model is trained, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.
Further, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training the student model, and executing the following steps:
step S31: inputting the training text in the set into the teacher model, and predicting and generating acoustic characteristic parameters in the set in a GTA mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameters and the first GTA acoustic feature parameters as training data, wherein the trained acoustic feature parameter prediction model is used as the student model.
Further, in the step S32, when the decoding submodule of the student model is trained, the first GTA acoustic feature parameter of the previous frame is taken as input, and the GT acoustic feature parameter of the current frame is taken as target output.
Further, the step 4: and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, the in-set training pronunciation text is input into the student model, and the in-set acoustic characteristic parameters are predicted and generated in a GTA mode to obtain second GTA acoustic characteristic parameters;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic feature parameter predicted by the student model as inputs;
step S43: and adopting the neural network vocoder as a voice synthesizer to perform end-to-end voice synthesis.
The end-to-end speech synthesis training method based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: the knowledge distillation method is adopted, a teacher model is trained firstly, then acoustic characteristic parameters predicted by the teacher model are used as input, a student model is trained, and finally the trained student model is used for carrying out end-to-end speech synthesis, so that the problem of poor hearing feeling of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.
The invention also provides an end-to-end speech synthesis training system based on knowledge distillation, which comprises:
the acquisition module is used for acquiring the original training data;
the teacher model training module is used for training a teacher model by utilizing the original training data;
the student model training module is used for training the student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
and the voice synthesis module is used for carrying out end-to-end voice synthesis by utilizing the trained student model.
Further, the teacher model training module includes:
the GT acoustic feature parameter extraction unit is used for extracting GT acoustic feature parameters from training audio in the original training data;
the teacher model training unit is configured to train an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameter in the original training data as training data, and the trained acoustic feature parameter prediction model is used as the teacher model.
Further, the student model training module includes:
the first GTA acoustic feature parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic feature parameters in a GTA mode to obtain first GTA acoustic feature parameters;
and the student model training unit is used for training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameter and the first GTA acoustic feature parameter as training data, and taking the trained acoustic feature parameter prediction model as the student model.
The end-to-end speech synthesis training system based on knowledge distillation provided by the embodiment of the invention has the following beneficial effects: the knowledge distillation technology is adopted, a teacher model is trained by using a teacher model training module, the student model is trained by using acoustic characteristic parameters predicted by the teacher model as input, the student model is trained, and the end-to-end speech synthesis is performed by using the trained student model, so that the problem of poor hearing of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of an end-to-end speech synthesis training method based on knowledge distillation in an embodiment of the invention;
FIG. 2 is a block diagram of an end-to-end speech synthesis training system based on knowledge distillation in accordance with an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
The embodiment of the invention provides an end-to-end voice synthesis training method based on knowledge distillation, which comprises the following steps as shown in fig. 1:
step 1: acquiring original training data;
step 2: training a teacher model by utilizing the original training data;
step 3: taking acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
step 4: and performing end-to-end voice synthesis by using the trained student model.
The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has obvious reduction of the synthesis effect outside the set, and one important reason is that the model training is not matched with the test. The decoding submodule (Decoder) of the acoustic feature parameter prediction model uses the GT acoustic feature parameter of the previous frame as an input in training, and uses the prediction output of the Decoder of the previous frame as a current input in testing, and the mismatch can cause the prediction precision of the acoustic feature parameter outside the set to be poor in testing, so that the hearing of the synthesized speech outside the set is poor.
The knowledge distillation principle is applied to training of an end-to-end voice synthesis system, after original training data are acquired, a teacher model is trained by utilizing the original training data, and then characteristic parameters predicted by the teacher model are used as training data to train a student model; finally, the trained student model is used for predicting acoustic characteristic parameters so as to perform end-to-end speech synthesis.
In the step 1, the original training data includes training audio and pronunciation text corresponding to the training audio.
The beneficial effects of the technical scheme are as follows: the knowledge distillation method is adopted, a teacher model is trained firstly, then acoustic characteristic parameters predicted by the teacher model are used as input, a student model is trained, and finally the trained student model is used for carrying out end-to-end speech synthesis, so that the problem of poor hearing feeling of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.
In one embodiment, the step 2: training the teacher model by using the original training data to execute the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: and training an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameters in the original training data as training data, wherein the trained acoustic feature parameter prediction model is used as the teacher model.
The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio are called GT (Ground Truth) acoustic feature parameters.
Further, in the step S22, when training a decoding submodule (Decoder) of the teacher model, the GT acoustic feature of the current frame is used as a target output, and the GT acoustic feature of the previous frame is used as an input.
The beneficial effects of the technical scheme are as follows: specific steps for training a teacher model using raw training data are provided.
In one embodiment, the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training the student model, and executing the following steps:
step S31: inputting the training text in the set into the teacher model, and predicting and generating acoustic characteristic parameters in the set in a GTA mode to obtain first GTA acoustic characteristic parameters;
step S32: and training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameters and the first GTA acoustic feature parameters as training data, wherein the trained acoustic feature parameter prediction model is used as the student model.
The working principle of the technical scheme is as follows: GTA (Ground Truth Align) means that the GT acoustic feature parameters of the previous frame are used as input to predict the acoustic feature of the current frame when reasoning is done in the decoding submodule (Decoder). And predicting the generated intra-set acoustic characteristic parameters in a GTA mode, namely a first GTA acoustic characteristic parameter. The teacher model is utilized to predict the acoustic characteristic parameters in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be ensured to be aligned in time length, and the problem of time length alignment of data is solved.
Further, in the step S32, when the decoding submodule of the student model is trained, the first GTA acoustic feature parameter of the previous frame is taken as input, and the GT acoustic feature parameter of the current frame is taken as target output.
The beneficial effects of the technical scheme are as follows: the specific step of training the student model by taking the characteristic parameters predicted by the teacher model as training data is provided, so that the GT acoustic characteristics and the predicted acoustic characteristic parameters can be ensured to be aligned in time length, and the problem of alignment of data time length is solved.
In one embodiment, the step 4: and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, the in-set training pronunciation text is input into the student model, and the in-set acoustic characteristic parameters are predicted and generated in a GTA mode to obtain second GTA acoustic characteristic parameters;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic feature parameter predicted by the student model as inputs;
step S43: and adopting the neural network vocoder as a voice synthesizer to perform end-to-end voice synthesis.
The working principle of the technical scheme is as follows: firstly, inputting an intra-set training text by using the student model obtained in the step 3, and predicting and generating intra-set acoustic characteristic parameters by adopting a GTA mode; then training a neural network vocoder by using training audio in the original training data and second GTA acoustic characteristic parameters predicted by the student model as inputs; and finally, adopting a student model as an acoustic characteristic parameter prediction model, and adopting the neural network vocoder as a synthesizer to obtain the end-to-end voice synthesis system for final use.
The beneficial effects of the technical scheme are as follows: specific steps for end-to-end speech synthesis using a trained student model are provided.
As shown in fig. 2, an embodiment of the present invention provides an end-to-end speech synthesis training system based on knowledge distillation, including:
an acquisition module 201, configured to acquire original training data;
a teacher model training module 202, configured to train a teacher model using the original training data;
the student model training module 203 is configured to train the student model by using the acoustic feature parameters predicted by the teacher model as training data;
and the speech synthesis module 204 is configured to perform end-to-end speech synthesis by using the trained student model.
The working principle of the technical scheme is as follows: the inventor finds that the traditional end-to-end synthesis system has obvious reduction of the synthesis effect outside the set, and one important reason is that the model training is not matched with the test. The decoding submodule (Decoder) of the acoustic feature parameter prediction model uses the GT acoustic feature parameter of the previous frame as an input in training, and uses the prediction output of the Decoder of the previous frame as a current input in testing, and the mismatch can cause the prediction precision of the acoustic feature parameter outside the set to be poor in testing, so that the hearing of the synthesized speech outside the set is poor.
In the invention, knowledge distillation principle is applied to training of an end-to-end voice synthesis system, and an acquisition module 201 acquires original training data; the teacher model training module 202 trains the teacher model using the raw training data; the student model training module 203 uses acoustic feature parameters predicted by the teacher model as training data to train the student model; the speech synthesis module 204 is configured to predict acoustic feature parameters using the trained student model for end-to-end speech synthesis.
Wherein, the original training data acquired by the acquisition module 201 includes training audio and pronunciation text corresponding to the training audio.
The beneficial effects of the technical scheme are as follows: the knowledge distillation technology is adopted, a teacher model is trained by using a teacher model training module, the student model is trained by using acoustic characteristic parameters predicted by the teacher model as input, the student model is trained, and the end-to-end speech synthesis is performed by using the trained student model, so that the problem of poor hearing of the synthesized speech outside the set caused by mismatching between training and testing can be effectively solved.
In one embodiment, the teacher model training module 202 includes:
the GT acoustic feature parameter extraction unit is used for extracting GT acoustic feature parameters from training audio in the original training data;
the teacher model training unit is configured to train an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameter in the original training data as training data, and the trained acoustic feature parameter prediction model is used as the teacher model.
The working principle of the technical scheme is as follows: the acoustic feature parameters extracted from the training audio by the GT acoustic feature parameter extraction unit are referred to as GT (Ground Truth) acoustic feature parameters.
Further, the teacher model training unit uses the GT acoustic features of the current frame as a target output and the GT acoustic features of the previous frame as an input when training a decoding sub-module (Decoder) of the teacher model.
The beneficial effects of the technical scheme are as follows: by means of the GT acoustic feature parameter extraction unit and the teacher model training unit, training of the teacher model can be achieved.
In one embodiment, the student model training module 203 includes:
the first GTA acoustic feature parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic feature parameters in a GTA mode to obtain first GTA acoustic feature parameters;
and the student model training unit is used for training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameter and the first GTA acoustic feature parameter as training data, and taking the trained acoustic feature parameter prediction model as the student model.
The working principle of the technical scheme is as follows: GTA (Ground Truth Align) means that the GT acoustic feature parameters of the previous frame are used as input to predict the acoustic feature of the current frame when reasoning is done in the decoding submodule (Decoder). And predicting the generated intra-set acoustic characteristic parameters in a GTA mode, namely a first GTA acoustic characteristic parameter. The teacher model is utilized to predict the acoustic characteristic parameters in a Ground Truth Align mode, so that the GT acoustic characteristic and the predicted acoustic characteristic parameters can be ensured to be aligned in time length, and the problem of time length alignment of data is solved.
Further, when the student model training unit trains the decoding submodule of the student model, the first GTA acoustic characteristic parameter of the previous frame is adopted as input, and the GT acoustic characteristic parameter of the current frame is adopted as target output.
The beneficial effects of the technical scheme are as follows: by means of the first GTA acoustic feature parameter prediction unit and the student model training unit, training of a student model can be achieved, the fact that GT acoustic features and predicted acoustic feature parameters are aligned in time length can be guaranteed, and therefore the problem of alignment of data time length is solved.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (4)

1. An end-to-end speech synthesis training method based on knowledge distillation, characterized in that the method performs the following steps:
step 1: acquiring original training data;
step 2: training a teacher model by utilizing the original training data;
step 3: taking acoustic characteristic parameters predicted by the teacher model as training data to train the student model;
step 4: performing end-to-end speech synthesis by using the trained student model;
in the step 1, the original training data comprises training audio and pronunciation text corresponding to the training audio;
the step 2: training the teacher model by using the original training data to execute the following steps:
step S21: extracting GT acoustic feature parameters from the training audio in the original training data;
step S22: training an acoustic feature parameter prediction model by using the pronunciation text and the extracted GT acoustic feature parameters in the original training data as training data, wherein the trained acoustic feature parameter prediction model is used as the teacher model;
the step 3: taking the acoustic characteristic parameters predicted by the teacher model as training data, training the student model, and executing the following steps:
step S31: inputting the training text in the set into the teacher model, and predicting and generating acoustic characteristic parameters in the set in a GTA mode to obtain first GTA acoustic characteristic parameters;
step S32: training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameters and the first GTA acoustic feature parameters as training data, wherein the trained acoustic feature parameter prediction model is used as the student model;
the step 4: and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
step S41: the student model is used as an acoustic characteristic parameter prediction model, the in-set training pronunciation text is input into the student model, and the in-set acoustic characteristic parameters are predicted and generated in a GTA mode to obtain second GTA acoustic characteristic parameters;
step S42: training a neural network vocoder by using the training audio and the second GTA acoustic feature parameter predicted by the student model as inputs;
step S43: and adopting the neural network vocoder as a voice synthesizer to perform end-to-end voice synthesis.
2. The method of claim 1, wherein in said step S22, in training the decoding submodule of the teacher model, the GT acoustic features of the current frame are used as target outputs and the GT acoustic features of the previous frame are used as inputs.
3. The method of claim 1, wherein in step S32, in training the decoding submodule of the student model, the first GTA acoustic feature parameter of the previous frame is taken as input, and the GT acoustic feature parameter of the current frame is taken as target output.
4. An end-to-end speech synthesis training system based on knowledge distillation, comprising:
the acquisition module is used for acquiring the original training data; the original training data comprises training audio and pronunciation text corresponding to the training audio;
the teacher model training module is used for training a teacher model by utilizing the original training data;
the student model training module is used for training the student model by taking the acoustic characteristic parameters predicted by the teacher model as training data;
the voice synthesis module is used for carrying out end-to-end voice synthesis by utilizing the trained student model;
the teacher model training module comprises:
the GT acoustic feature parameter extraction unit is used for extracting GT acoustic feature parameters from training audio in the original training data;
a teacher model training unit, configured to train an acoustic feature parameter prediction model using the pronunciation text and the extracted GT acoustic feature parameter in the original training data as training data, where the trained acoustic feature parameter prediction model is used as the teacher model;
the student model training module comprises:
the first GTA acoustic feature parameter prediction unit is used for inputting the in-set training text into the teacher model, and predicting and generating in-set acoustic feature parameters in a GTA mode to obtain first GTA acoustic feature parameters;
the student model training unit is used for training an acoustic feature parameter prediction model by adopting the pronunciation text, the GT acoustic feature parameter and the first GTA acoustic feature parameter as training data, and the trained acoustic feature parameter prediction model is used as the student model;
and performing end-to-end speech synthesis by using the trained student model, and executing the following steps:
the student model is used as an acoustic characteristic parameter prediction model, the in-set training pronunciation text is input into the student model, and the in-set acoustic characteristic parameters are predicted and generated in a GTA mode to obtain second GTA acoustic characteristic parameters;
training a neural network vocoder by using the training audio and the second GTA acoustic feature parameter predicted by the student model as inputs;
and adopting the neural network vocoder as a voice synthesizer to perform end-to-end voice synthesis.
CN202010718085.2A 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation Active CN112002303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010718085.2A CN112002303B (en) 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010718085.2A CN112002303B (en) 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation

Publications (2)

Publication Number Publication Date
CN112002303A CN112002303A (en) 2020-11-27
CN112002303B true CN112002303B (en) 2023-12-15

Family

ID=73467751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010718085.2A Active CN112002303B (en) 2020-07-23 2020-07-23 End-to-end speech synthesis training method and system based on knowledge distillation

Country Status (1)

Country Link
CN (1) CN112002303B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735389A (en) * 2020-12-29 2021-04-30 平安科技(深圳)有限公司 Voice training method, device and equipment based on deep learning and storage medium
CN113611311A (en) * 2021-08-20 2021-11-05 天津讯飞极智科技有限公司 Voice transcription method, device, recording equipment and storage medium
CN115376484A (en) * 2022-08-18 2022-11-22 天津大学 Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10872596B2 (en) * 2017-10-19 2020-12-22 Baidu Usa Llc Systems and methods for parallel wave generation in end-to-end text-to-speech

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Statistical Parametric Speech Synthesis Using Generalized Distillation Framework;Z. -C. Liu等;IEEE Signal Processing Letters;第25卷(第05期);695-699 *
Teacher-Student Training For Robust Tacotron-Based TTS;R. Liu等;ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020;6274-6278 *
结合发音特征与深度学习的语音生成方法研究;刘正晨;中国博士学位论文全文数据库 信息科技辑(第10期);I136-28 *

Also Published As

Publication number Publication date
CN112002303A (en) 2020-11-27

Similar Documents

Publication Publication Date Title
CN112002303B (en) End-to-end speech synthesis training method and system based on knowledge distillation
Valin et al. LPCNet: Improving neural speech synthesis through linear prediction
WO2021128256A1 (en) Voice conversion method, apparatus and device, and storage medium
CN107871496B (en) Speech recognition method and device
CN112017644A (en) Sound transformation system, method and application
CN108053823A (en) A kind of speech recognition system and method
CN105654939A (en) Voice synthesis method based on voice vector textual characteristics
Siuzdak et al. WavThruVec: Latent speech representation as intermediate features for neural speech synthesis
CN102436807A (en) Method and system for automatically generating voice with stressed syllables
CN111128211B (en) Voice separation method and device
CN112509563A (en) Model training method and device and electronic equipment
CN113053357B (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN112364125B (en) Text information extraction system and method combining reading course learning mechanism
CN111968617A (en) Voice conversion method and system for non-parallel data
CN111986646A (en) Dialect synthesis method and system based on small corpus
CN101887719A (en) Speech synthesis method, system and mobile terminal equipment with speech synthesis function
CN111724809A (en) Vocoder implementation method and device based on variational self-encoder
Yang et al. Adversarial feature learning and unsupervised clustering based speech synthesis for found data with acoustic and textual noise
CN107274883A (en) Voice signal reconstructing method and device
CN116092475B (en) Stuttering voice editing method and system based on context-aware diffusion model
CN113450760A (en) Method and device for converting text into voice and electronic equipment
CN107464569A (en) Vocoder
CN116844522A (en) Phonetic boundary label marking method and speech synthesis method
CN116741144A (en) Voice tone conversion method and system
CN115762471A (en) Voice synthesis method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant