CN115359778A - Confrontation and meta-learning method based on speaker emotion voice synthesis model - Google Patents

Confrontation and meta-learning method based on speaker emotion voice synthesis model Download PDF

Info

Publication number
CN115359778A
CN115359778A CN202211010973.4A CN202211010973A CN115359778A CN 115359778 A CN115359778 A CN 115359778A CN 202211010973 A CN202211010973 A CN 202211010973A CN 115359778 A CN115359778 A CN 115359778A
Authority
CN
China
Prior art keywords
emotion
training
speaker
meta
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211010973.4A
Other languages
Chinese (zh)
Inventor
张句
贡诚
王宇光
关昊天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huiyan Technology Tianjin Co ltd
Original Assignee
Huiyan Technology Tianjin Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huiyan Technology Tianjin Co ltd filed Critical Huiyan Technology Tianjin Co ltd
Priority to CN202211010973.4A priority Critical patent/CN115359778A/en
Publication of CN115359778A publication Critical patent/CN115359778A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of speech synthesis deep learning, and particularly relates to a confrontation and meta-learning method based on a speaker emotion speech synthesis model. The method comprises data preprocessing, design of an end-to-end speech synthesis basic model and addition of a confrontation training module for erasing timbre information in emotion embedded representation, wherein the confrontation training module mainly comprises a gradient inversion layer, a speker classifier and training based on meta learning. The generalization ability of the emotion voice synthesis model is improved by adopting a meta-learning mode, so that the emotion voice synthesis model can be quickly adapted to voice data of a small number of speakers.

Description

Confrontation and meta-learning method based on speaker emotion voice synthesis model
Technical Field
The invention belongs to the technical field of deep learning, and particularly relates to a confrontation and meta-learning method based on a speaker emotion voice synthesis model.
Background
Speech is one of the important tools for human interaction. Human speech not only contains character symbol information, but also contains changes of emotion and emotion of people. For example, the same sentence is often different due to the emotion of the speaker. The meaning of the sound is different from the impression given to the listener, and the meaning of the sound is called "listening to the voice. In general speech synthesis systems, most of them look at the naturalness and accuracy of synthesized speech and ignore emotion factors contained in speech signals.
In recent years, under the promotion of methods such as deep learning and the like, a voice synthesis technology is greatly developed, particularly emotion voice synthesis, and because the method can organically integrate spoken language analysis and emotion analysis of voice with a computer technology, a foundation is laid for realizing a human-oriented voice interaction system with personalized characteristics.
At present, in domestic and foreign researches, most of the research is to learn emotion embedding of reference audio through an unsupervised style encoder, so that end-to-end emotion voice synthesis is realized. However, since emotion is derived from the speech of a reference audio (source) speaker, the tone information of the source speaker can also be transferred to the synthesized speech, making the synthesized speech sound like the source speaker or between the source speaker and the target speaker, creating a so-called speaker tone leakage problem. In addition, because the recording cost of emotion data is high, a large amount of data of different emotions of any speaker is difficult to obtain, so that the emotion expressive force transmitted in synthesized voice is insufficient, and the quality of emotion voice synthesis is restricted.
In order to synthesize the emotion voice of a target speaker by transferring emotion from the reference audio of a source speaker and simultaneously keep the tone of the target speaker in the synthesized voice, an emotion decoupling module based on confrontation training is provided, and speaker information contained in emotion embedding is erased in the process of model training. In addition, considering that recording a corpus with different emotion classes is very challenging, the generalization capability of the emotion voice synthesis model is improved by adopting a meta-learning mode, so that the emotion voice synthesis model can be quickly adapted to voice data of a small number of speakers.
Disclosure of Invention
The invention aims to solve the technical problems in the background technology and adopts a confrontation and meta-learning method based on a speaker emotion voice synthesis model.
The technical scheme of the invention is a confrontation and meta-learning method based on speaker emotion voice synthesis model, comprising the following steps:
step one, preprocessing data: the text needs to be subjected to front-end processing, paired data of the text and the audio is used as training data, and meanwhile, mel spectrum features need to be extracted;
step two, designing an end-to-end speech synthesis basic model: design based on end-to-end speech synthesis Tacotron2, we make the following modifications for the emotion speech synthesis task:
1) Adding speaker information, coding different speakers into different speaker IDs, simultaneously taking the speaker IDs as inputs, and obtaining speaker embedded representation through a look-up LUT;
2) Adding an emotion encoder for learning emotion embedded representation, wherein the input of the emotion encoder is reference audio, the emotion encoder comprises a 5-layer one-dimensional convolution and a bidirectional LSTM, and the emotion embedded representation can be obtained after the emotion encoder;
the speaker embedded representation and the emotion embedded representation are combined with the text representation output by the text encoder of the Tacotron2 model to jointly guide the generation of the final Mel spectrum characteristics;
step three, adding a confrontation training module: adding a confrontation training module for erasing tone information in the emotion embedded representation, wherein the confrontation training module mainly comprises a gradient inversion layer and a spaker classifier;
after the confrontation training module is added, firstly, pre-training is carried out by adopting data of a plurality of different speakers and emotions to obtain a basic emotion voice synthesis model so as to realize emotion voice synthesis of the speakers in a training set;
definition of
Figure 338280DEST_PATH_IMAGE001
The parameters of the neural network at the moment are taken as the initial parameters of the fourth step;
step four, training based on meta-learning: and (3) training the emotion voice synthesis model obtained by training in the step three again in a meta-learning mode, wherein the method comprises the following steps:
1) Firstly, constructing a series of meta-tasks in a multi-speaker voice synthesis database, wherein a support set training set and a query set testing set of each meta-task all contain K samples and Q samples of the same speaker and are defined;
defining each sample as
Figure 208147DEST_PATH_IMAGE002
Wherein
Figure 490224DEST_PATH_IMAGE003
Is a representation of the text of the sample,
Figure 129147DEST_PATH_IMAGE004
the acoustic features of the sample are Mel spectral features;
2) The following training process is performed iteratively:
a. sampling any one training task m, using the Support Set of the task m, optimizing the emotion voice synthesis model obtained by training in the step three once based on the learning rate of the task m, and updating to obtain new parameters
Figure 104056DEST_PATH_IMAGE005
(ii) a Performing one-time optimization to express neural network weights to be trainedCarrying out one-time backward propagation again, and carrying out gradient descent according to the gradient so as to update the weight;
b. based on after one optimization
Figure 195640DEST_PATH_IMAGE005
Computing loss = for task m using Query Set
Figure 278478DEST_PATH_IMAGE006
And calculate
Figure 771908DEST_PATH_IMAGE006
To pair
Figure 917718DEST_PATH_IMAGE005
A gradient of (a);
wherein loss is a loss function of the emotion speech synthesis model obtained by training in step three, which is referred to as the acoustic features predicted by the model in the present document
Figure 293336DEST_PATH_IMAGE007
And true characteristics of the sample
Figure 854898DEST_PATH_IMAGE004
The error between;
c. multiplying the learning rate of the meta network by the above gradient
Figure 530730DEST_PATH_IMAGE008
Updating neural network parameters to obtain
Figure 53634DEST_PATH_IMAGE009
Wherein the content of the first and second substances,
Figure 119810DEST_PATH_IMAGE001
the parameters of the neural network model obtained after the step three are referred to,
Figure 547380DEST_PATH_IMAGE010
refers to the parameters after one update;
d. repeating the training process of a-c;
and step five, synthesizing audio.
The first step adopts python natural language processing tool kit NLTK for the front-end processing of the text to perform word segmentation and Chinese character phonetic conversion operation, and directly adopts python common audio processing tool kit librosa for extraction and preprocessing of Mel spectral features, and the Mel spectral features are extracted from the audio through framing, windowing and pre-emphasis.
The fifth step is as follows: after the final training of the model is finished, sequentially executing the following steps for synthesizing emotional voice;
performing final fine adjustment on the parameters of the neural network model obtained in the step four by adopting a small amount of data sets of the target speakers;
inputting a text, a reference audio and an ID of a target spaker, obtaining a Mel spectral characteristic through model prediction, and finally converting the Mel spectral characteristic into an audio.
Advantageous effects
1. An end-to-end emotion voice synthesis system is built, and an unsupervised style encoder is adopted to learn emotion embedded expression in audio.
2. An emotion decoupling module based on confrontation training is designed, and speaker information contained in emotion embedding is erased in the model training process.
3. The generalization ability of the emotion voice synthesis model is improved by adopting a meta-learning mode, so that the emotion voice synthesis model can be quickly adapted to voice data of a small number of speakers.
Drawings
FIG. 1 illustrates a speech synthesis base model.
FIG. 2 is a model diagram after adding a countermeasure module.
FIG. 3 is a schematic representation of the Mel spectra.
Detailed Description
The invention is further described below with reference to the figures and examples.
The confrontation and meta-learning method based on the speaker emotion voice synthesis model comprises the following specific steps:
step one, preprocessing data, wherein a text needs to be subjected to front-end processing, usually characters are used as input, and data of a text and audio pair can be used as training data. Meanwhile, the pretreatment also needs extraction of Mel spectrum characteristics.
For the preprocessing of the text, a python natural language processing tool kit NLTK is usually adopted to perform operations such as word segmentation, chinese character pinyin conversion and the like, and for the extraction of Mel spectral features, a python common audio processing tool kit librosa can be directly adopted to extract the Mel spectral features from audio through processes such as framing, windowing, pre-emphasis and the like.
For example, the Chinese text "Karl Pupiai grandson playing slide" is preprocessed to obtain "ka3 er3 pu3 #1 pei1 wai4 sun1 #1 wan2 hua1 ti1 #3", where #1, #3 indicate pause durations of different degrees.
For the audio corresponding to the text, the mel spectrum obtained after the pre-processing is shown in fig. 3.
And step two, designing an end-to-end speech synthesis basic model. The invention is designed based on end-to-end speech synthesis Tacotron 2. The role of Tacotron2 is to input text and predict the Mel spectral features, and we make the following modifications for the emotion speech synthesis task:
1) Adding speaker information, coding different speakers into different speaker IDs, simultaneously taking the speaker IDs as input, and obtaining speaker embedded representation through a look-up table LUT.
For example, taking three speakers as an example, three speakers can each input in the following form:
Speaker_1 = [1,0,0]
Speaker_2 = [0,1,0]
Speaker_3 = [0,0,1]
2) And adding an emotion coder for learning the emotion embedded representation, wherein the input of the emotion coder is reference audio, the emotion coder comprises a 5-layer one-dimensional convolution and a bidirectional LSTM, and the emotion embedded representation is obtained after the emotion coder.
The speaker embedded representation and the emotion embedded representation are combined with the text representation output by the text encoder of the Tacotron2 model to jointly guide the generation of the final Mel-spectral features, as shown in FIG. 1.
And step three, adding a confrontation training module. Considering that the emotion embedded representation may contain some speaker information and interfere with the tone of the final synthesized speech, a confrontation training module is added to wipe off the tone information in the emotion embedded representation. The countermeasure module consists essentially of a gradient inversion layer, and a spaker classifier, as shown in the dashed box in FIG. 2.
After the confrontation training module is added, pre-training can be performed by adopting data of a plurality of different speakers and emotions to obtain a basic emotion voice synthesis model, so that emotion voice synthesis of speakers in a training set can be realized;
definition of
Figure 15402DEST_PATH_IMAGE001
The neural network parameters at this time are taken as the initial parameters of the step four.
And step four, training based on meta-learning. And in order to improve the generalization capability of the model and reduce the dependence of the model on data, training the model obtained by training in the step three again in a meta-learning mode. Specifically, the method comprises the following steps:
firstly, constructing a series of meta-tasks (meta-task sets) in a multi-speaker voice synthesis database, wherein a support set and a query set of each meta-task both contain K samples and Q samples of the same speaker and are defined;
defining each sample as
Figure 706277DEST_PATH_IMAGE002
Wherein
Figure 56487DEST_PATH_IMAGE003
Is a representation of the text of the sample,
Figure 488081DEST_PATH_IMAGE004
the acoustic features of the sample are Mel spectral features;
second, the following training process is performed iteratively:
a. sampling any training task m, optimizing the network once by using the Support Set of the task m based on the learning rate of the task m, and updating to obtain new parameters
Figure 607347DEST_PATH_IMAGE005
. And performing optimization once, namely performing back propagation on the trained neural network weight once, and performing gradient descent according to the gradient so as to update the weight.
b. Based on after one optimization
Figure 937965DEST_PATH_IMAGE005
Computing loss = of task m using Query Set
Figure 41050DEST_PATH_IMAGE006
And calculate
Figure 13686DEST_PATH_IMAGE006
To pair
Figure 721879DEST_PATH_IMAGE005
Of the gradient of (c). Wherein loss is a loss function of the emotion speech synthesis model obtained by training in step three, and is referred to as an acoustic feature predicted by the model in this document
Figure 214609DEST_PATH_IMAGE007
And true characteristics of the sample
Figure 8253DEST_PATH_IMAGE004
The error between.
c. Multiplying the learning rate of the meta network by the above gradient
Figure 581317DEST_PATH_IMAGE008
Updating parameters of the original network to obtain
Figure 612858DEST_PATH_IMAGE010
Wherein, in the step (A),ϕthe parameters of the neural network model obtained after the step three are referred to,
Figure 816437DEST_PATH_IMAGE010
refers to the parameters after an update.
d. The training process of a-c above is repeated. (depending on the size of the training data set, e.g. if the training data set contains only 100 pieces of data, iterating all the meta-task sets around 20 times stops)
And step five, synthesizing audio. After the final training of the model is completed, the following steps may be performed sequentially to synthesize emotional speech.
1. And obtaining the parameters of the network through the step four, and performing final finetune adjustment on the parameters by adopting a small amount of data sets of the target speaker.
When the timbre of a given speaker needs to be synthesized, the audio assets of the given speaker may be only 3-5 minutes, and therefore, the network needs to be fine-tuned last with this small amount of data.
2. Inputting a text, a reference audio (consistent with a target emotion) and an ID of a target spearer, obtaining a Mel spectrum characteristic through model prediction, and finally converting the Mel spectrum characteristic into an audio.

Claims (3)

1. The confrontation and meta-learning method based on the speaker emotion voice synthesis model is characterized by comprising the following steps of:
step one, preprocessing data: the text needs to be subjected to front-end processing, paired data of the text and the audio is used as training data, and meanwhile, mel spectrum features need to be extracted;
step two, designing an end-to-end speech synthesis basic model: design is carried out based on end-to-end speech synthesis Tacotron2, and the emotion speech synthesis task is modified as follows:
adding speaker information, coding different speakers into different speaker IDs, simultaneously using the speaker IDs as input, and obtaining speaker embedded representation through a look-up table LUT;
adding an emotion encoder for learning emotion embedded representation, wherein the input of the emotion encoder is a reference audio, the emotion encoder comprises a 5-layer one-dimensional convolution and a bidirectional LSTM, and the emotion embedded representation can be obtained after the emotion encoder;
the speaker embedded representation and the emotion embedded representation are combined with the text representation output by the text encoder of the Tacotron2 model to jointly guide the generation of the final Mel spectrum characteristics;
step three, adding a confrontation training module: adding a confrontation training module for erasing tone information in the emotion embedded representation, wherein the confrontation training module mainly comprises a gradient inversion layer and a spaker classifier;
after the confrontation training module is added, firstly, pre-training is carried out by adopting data of a plurality of different speakers and emotions to obtain a basic emotion voice synthesis model so as to realize emotion voice synthesis of the speakers in a training set;
definition of
Figure 88513DEST_PATH_IMAGE001
The parameters of the neural network at the moment are taken as the initial parameters of the fourth step;
step four, training based on meta-learning: and (3) training the emotion voice synthesis model obtained by training in the step three again in a meta-learning mode, wherein the method comprises the following steps:
firstly, constructing a series of meta-tasks in a multi-speaker voice synthesis database, wherein a support set training set and a query set testing set of each meta-task all contain K samples and Q samples of the same speaker and are defined;
defining each sample as
Figure 943336DEST_PATH_IMAGE002
In which
Figure 660756DEST_PATH_IMAGE003
Is a representation of the text of the sample,
Figure 993649DEST_PATH_IMAGE004
is an acoustic feature of the sample, soundThe mathematical features are Mel spectrum features;
the following training process is performed iteratively:
sampling any training task m, using the Support Set of the task m, and performing task pair based on the learning rate of the task m
Figure 745704DEST_PATH_IMAGE001
Optimizing once, updating to obtain new parameters
Figure 237341DEST_PATH_IMAGE005
Performing primary optimization, namely performing primary back propagation on the trained neural network weight, and performing gradient descent according to the gradient so as to update the weight;
based on after one optimization
Figure 442057DEST_PATH_IMAGE005
Computing loss = of task m using Query Set
Figure 47482DEST_PATH_IMAGE006
And calculate
Figure 654044DEST_PATH_IMAGE006
For is to
Figure 319512DEST_PATH_IMAGE005
A gradient of (a);
wherein, loss is a loss function of the emotion speech synthesis model obtained by training in the step three, specifically referring to the acoustic characteristics predicted by the model
Figure 745945DEST_PATH_IMAGE007
And true characteristics of the sample
Figure 420640DEST_PATH_IMAGE004
An error therebetween;
multiplying the learning rate of the meta network by the above gradient
Figure 353479DEST_PATH_IMAGE008
Updating neural network parameters to obtain
Figure 721007DEST_PATH_IMAGE009
Wherein the content of the first and second substances,
Figure 634736DEST_PATH_IMAGE001
the parameters of the neural network model obtained after the step three are referred to,
Figure 847543DEST_PATH_IMAGE009
refers to the parameters after one update;
repeating the training process a to c above;
and step five, synthesizing audio.
2. The method as claimed in claim 1, wherein the step one is to use python natural language processing tool kit NLTK for the front-end processing of text to perform word segmentation and chinese character to pinyin conversion, and to use python common audio processing tool kit librosa directly for the extraction of audio features to extract mel spectrum features from audio through framing, windowing and pre-emphasis.
3. The method for confrontation and meta-learning based on the speaker emotion speech synthesis model as claimed in claim 1, wherein said step five specifically comprises: after the final training of the model is finished, sequentially executing the following steps for synthesizing emotional voice;
carrying out final fine adjustment on the parameters of the neural network model obtained in the step four by adopting a small amount of data sets of the target speakers;
inputting a text, a reference audio and an ID of a target spaker, obtaining a Mel spectral characteristic through model prediction, and finally converting the Mel spectral characteristic into an audio.
CN202211010973.4A 2022-08-23 2022-08-23 Confrontation and meta-learning method based on speaker emotion voice synthesis model Pending CN115359778A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211010973.4A CN115359778A (en) 2022-08-23 2022-08-23 Confrontation and meta-learning method based on speaker emotion voice synthesis model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211010973.4A CN115359778A (en) 2022-08-23 2022-08-23 Confrontation and meta-learning method based on speaker emotion voice synthesis model

Publications (1)

Publication Number Publication Date
CN115359778A true CN115359778A (en) 2022-11-18

Family

ID=84002154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211010973.4A Pending CN115359778A (en) 2022-08-23 2022-08-23 Confrontation and meta-learning method based on speaker emotion voice synthesis model

Country Status (1)

Country Link
CN (1) CN115359778A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496944A (en) * 2024-01-03 2024-02-02 广东技术师范大学 Multi-emotion multi-speaker voice synthesis method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496944A (en) * 2024-01-03 2024-02-02 广东技术师范大学 Multi-emotion multi-speaker voice synthesis method and system
CN117496944B (en) * 2024-01-03 2024-03-22 广东技术师范大学 Multi-emotion multi-speaker voice synthesis method and system

Similar Documents

Publication Publication Date Title
Han et al. Semantic-preserved communication system for highly efficient speech transmission
Mehrish et al. A review of deep learning techniques for speech processing
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
CN112863483A (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN111433847B (en) Voice conversion method, training method, intelligent device and storage medium
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
CN112017644A (en) Sound transformation system, method and application
Sheikhan et al. Using DTW neural–based MFCC warping to improve emotional speech recognition
Zhu et al. Phone-to-audio alignment without text: A semi-supervised approach
CN110767210A (en) Method and device for generating personalized voice
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
Wang et al. Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation.
JP2024505076A (en) Generate diverse, natural-looking text-to-speech samples
CN112786018A (en) Speech conversion and related model training method, electronic equipment and storage device
CN113470622B (en) Conversion method and device capable of converting any voice into multiple voices
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN115547290A (en) Mixed reading voice synthesis method based on mixed text representation and speaker confrontation
Laurinčiukaitė et al. Lithuanian Speech Corpus Liepa for development of human-computer interfaces working in voice recognition and synthesis mode
CN115359778A (en) Confrontation and meta-learning method based on speaker emotion voice synthesis model
Mei et al. A particular character speech synthesis system based on deep learning
CN115359775A (en) End-to-end tone and emotion migration Chinese voice cloning method
Daouad et al. An automatic speech recognition system for isolated Amazigh word using 1D & 2D CNN-LSTM architecture
CN116092471A (en) Multi-style personalized Tibetan language speech synthesis model oriented to low-resource condition
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination