WO2020118521A1 - Synthèse texte-vers-parole neuronale multilocuteurs - Google Patents

Synthèse texte-vers-parole neuronale multilocuteurs Download PDF

Info

Publication number
WO2020118521A1
WO2020118521A1 PCT/CN2018/120300 CN2018120300W WO2020118521A1 WO 2020118521 A1 WO2020118521 A1 WO 2020118521A1 CN 2018120300 W CN2018120300 W CN 2018120300W WO 2020118521 A1 WO2020118521 A1 WO 2020118521A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
neural
acoustic feature
space information
latent space
Prior art date
Application number
PCT/CN2018/120300
Other languages
English (en)
Inventor
Yan Deng
Lei He
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to CN201880091361.8A priority Critical patent/CN111954903B/zh
Priority to PCT/CN2018/120300 priority patent/WO2020118521A1/fr
Priority to US17/293,640 priority patent/US20220013106A1/en
Priority to EP18942805.5A priority patent/EP3895159A4/fr
Publication of WO2020118521A1 publication Critical patent/WO2020118521A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Definitions

  • Text-to-speech (TTS) synthesis aims at generating a corresponding speech waveform based on a text input.
  • the TTS synthesis is widely applied for, e.g., role playing in a fairytale, speech-to-speech translation, speech customization for certain users, etc.
  • Neural TTS system is being more and more adopted for implementing TTS synthesis, and is tending to be one of the most popular directions in Artificial Intelligence (AI) field in recent years.
  • the neural TTS system may predict acoustic features based on a text input, and further generate a speech waveform based on the predicted acoustic features.
  • the neural TTS system is modeled in an end-to-end structure and may be trained directly based on text-speech data pairs.
  • the neural TTS system may jointly optimize pronunciation, prosody, etc. of speech, which results in more natural synthesized speech than the traditional TTS techniques.
  • Embodiments of the present disclosure propose method and apparatus for generating speech through multi-speaker neural TTS synthesis.
  • a text input may be received.
  • Speaker latent space information of a target speaker may be provided through at least one speaker model.
  • At least one acoustic feature may be predicted through an acoustic feature predictor based on the text input and the speaker latent space information.
  • a speech waveform corresponding to the text input may be generated through a neural vocoder based on the at least one acoustic feature and the speaker latent space information.
  • FIG. 1 illustrates an exemplary traditional neural TTS system.
  • FIG. 2 illustrates an exemplary architecture of a multi-speaker neural TTS system according to an embodiment.
  • FIG. 3 illustrates exemplary implementations of a speaker model according to an embodiment.
  • FIG. 4 illustrates an exemplary implementation of a speaker encoder according to an embodiment.
  • FIG. 5 illustrates an exemplary implementation of a multi-speaker neural TTS system according to an embodiment.
  • FIG. 6 illustrates an exemplary implementation of a multi-speaker neural TTS system according to an embodiment.
  • FIG. 7 illustrates an exemplary implementation of an acoustic feature predictor according to an embodiment.
  • FIG. 8 illustrates an exemplary implementation of a neural vocoder according to an embodiment.
  • FIG. 9 illustrates an exemplary process for training a multi-speaker neural TTS system according to an embodiment.
  • FIG. 10 illustrates an exemplary process for updating a multi-speaker neural TTS system according to an embodiment.
  • FIG. 11 illustrates an exemplary process for updating a multi-speaker neural TTS system according to an embodiment.
  • FIG. 12 illustrates an exemplary processing flow for generating a speech waveform according to an embodiment.
  • FIG. 13 illustrates an exemplary architecture of a multi-speaker neural TTS system according to an embodiment.
  • FIG. 14 illustrates a flowchart of an exemplary method for generating speech through multi-speaker neural TTS synthesis according to an embodiment.
  • FIG. 15 illustrates an exemplary apparatus for generating speech through multi-speaker neural TTS synthesis according to an embodiment.
  • FIG. 16 illustrates an exemplary apparatus for generating speech through multi-speaker neural TTS synthesis according to an embodiment.
  • a neural TTS system may generate natural speech with high fidelity, it needs a large amount of text-speech training data pairs due to its end-to-end model nature.
  • a training corpus of around 10+ hours of speech may be still not enough for training a good end-to-end neural TTS system.
  • corpus may refer to a set of speeches with each speech being attached with a corresponding text, and thus a corpus may provide a plurality of text-speech data pairs.
  • a challenge for the neural TTS system is its generalization ability. Degradation of naturalness on synthesizing an out-of-domain text often happens, especially for a long text with a rather complex context.
  • out-of-domain text refers to a text input which is not involved in a training corpus, or for which no relevant text input is involved in the training corpus.
  • the limits from generation model architecture in the neural TTS system may result in various out-of-domain errors, e.g., wrong pronunciation, strange prosody, repeating or skipping word/phoneme, etc.
  • adding more training data is a brute force solution, such heavy data requirements cannot be satisfied by using a single speaker corpus which always provides limited text-speech data pairs.
  • training data may be augmented by combining corpuses of multiple speakers into a multi-speaker corpus set.
  • the multi-speaker corpus set may be used for training a multi-speaker neural TTS system.
  • the multi-speaker neural TTS system may generate better speech than a single-speaker TTS system, and can be used for creating customized speech using a limited-size corpus.
  • rich content and speaker information in the multi-speaker corpus set are not well modeled, and the generated speech still has unnatural and muffled problems.
  • speaker similarity is also low for a target speaker having only a small corpus. The overall performance of such system is still far from actual application requirements.
  • Embodiments of the present disclosure propose new approaches to building a multi-speaker neural TTS system with a well-designed multi-speaker corpus set.
  • a high-quality multi-speaker corpus set may be prepared in consideration of content coverage, speaker variety, style variety, etc.
  • the corpus set may have wide content coverage in various knowledge domains, thus the multi-speaker neural TTS system may leverage content from different speakers in various domains and perform better in terms of generalization.
  • speakers in the corpus set may have a balanced distribution in terms of age, gender, accents, etc., which makes it much easier to create speech for a target speaker having only a small corpus. This may facilitate to create high-fidelity customized speech through the multi-speaker neural TTS system.
  • the corpus set as discussed above will be helpful for the multi-speaker neural TTS system to generate close-to-human speech for out-of-domain text inputs, especially for long sentences with complex context, thus enriching premium voices.
  • the embodiments of the present disclosure propose new model architecture for the multi-speaker neural TTS system in order to make better use of the multi-speaker corpus set and improve speech generalization ability.
  • the multi-speaker neural TTS system may be built with full utilization of latent space information of speakers in the corpus set.
  • the multi-speaker neural TTS system may be further updated, e.g., retrained by a subset of training data pairs in the corpus set, adapted to a target speaker with a corpus of the target speaker, etc.
  • the multi-speaker neural TTS system may be further retrained or refined by a corpus of at least one speaker in the corpus set. For example, when it is to generate speech for a target speaker, e.g., stimulating voices of the target speaker to speak, the multi-speaker neural TTS system may be adapted to the target speaker through being updated or retrained by a corpus of the target speaker. Accordingly, the multi-speaker neural TTS system may generate high quality speech with high speaker similarity.
  • FIG. 1 illustrates an exemplary traditional neural TTS system 100.
  • the neural TTS system 100 may be configured for receiving a text input 102 and generating a speech waveform 106 corresponding to the text input 102.
  • the text input 102 may be a word, phrase, sentence, etc. It should be appreciated that although it is shown in FIG. 1 that the text input 102 is provided to the neural TTS system 100, the text input 102 may also be firstly split into a sequence of elements, e.g., a phoneme sequence, a grapheme sequence, a character sequence, etc. through various existing techniques, e.g., Letter-to-Sound (LTS) , etc., and then the sequence may be provided to the neural TTS system 100 as input.
  • a “text input” may also be broadly interpreted as a sequence of elements obtained from the text input, such as a phoneme sequence, a grapheme sequence, a character sequence, etc.
  • the neural TTS system 100 may comprise an acoustic feature predictor 110.
  • the acoustic feature predictor 110 may predict acoustic features 104 from the text input 102.
  • the acoustic features 104 may comprise various traditional TTS acoustic features, e.g., mel-spectrum, linear spectrum pairs (LSP) , etc.
  • the acoustic feature predictor 110 may be based on various model architectures, e.g., a sequence-to-sequence model architecture, etc.
  • FIG. 1 shows an exemplary sequence-to-sequence acoustic feature predictor 110, which may comprise an encoder 112, an attention unit 114 and a decoder 116.
  • the encoder 112 may convert information contained in the text input 102 into a space that is more robust and more suitable to learn alignment with acoustic features, e.g., converting the information in the text input 102 into text features in the space.
  • the encoder 112 may be based on various network structures, e.g., a network structure comprising a combination of a plurality of convolutional neural network (CNN) layers and a plurality of recurrent neural network (RNN) layers, a network structure comprising a combination of 1-D convolutional filters, highway networks and bi-directional RNN, and so on.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • the attention unit 114 may implement an attention mechanism which acts as a bridge connecting the encoder 112 and the decoder 116.
  • the attention mechanism may facilitate to make alignment between text features output by the encoder 112 and acoustic features.
  • Various types of attention mechanism may be implemented by the attention unit 114, e.g., soft attention, hard attention, location sensitive attention, Gaussian Mixture Model (GMM) attention, etc.
  • GMM Gaussian Mixture Model
  • the decoder 116 may map the text features output by the encoder 112 to the acoustic features 104 under impacts by the attention mechanism in the attention unit 114.
  • the decoder 116 may be based on various network structures, e.g., a network structure comprising a combination of feed-forward layers, Long Short Term Memory (LSTM) layers and CNN layers, and so on.
  • LSTM Long Short Term Memory
  • the neural TTS system 100 may comprise a neural vocoder 120.
  • the neural vocoder 120 may generate the speech waveform 106 based on the predicted acoustic features 104 output by the acoustic feature predictor 110.
  • the neural vocoder 120 may be based on various network structures, e.g., a network structure which is based on neural generative models, and so on.
  • FIG. 2 illustrates an exemplary architecture of a multi-speaker neural TTS system 200 according to an embodiment.
  • the multi-speaker neural TTS system 200 may generate speech for a variety of speakers involved in a multi-speaker corpus set used for training the multi-speaker neural TTS system 200, and may also generate speech for a new speaker.
  • the multi-speaker neural TTS system 200 may consider discriminative information in a speaker latent space, e.g., speaker latent space information.
  • the speaker latent space information may be used for controlling to generate speech with characteristics of the target speaker.
  • the multi-speaker neural TTS system 200 may be configured for receiving a text input 202, and generating a speech waveform 206 corresponding to the text input 102 in a target speaker’s voices.
  • the multi-speaker neural TTS system 200 may comprise an acoustic feature predictor 210, a neural vocoder 220 and a speaker model 230.
  • the speaker model 230 may provide speaker latent space information 232 of a target speaker.
  • the speaker latent space information 232 may be representations of speaker characteristics in the speaker latent space, e.g., a speaker embedding vector of the target speaker.
  • the speaker latent space information 232 may be used as additional information, e.g., a condition, for the acoustic feature predictor 210 and the neural vocoder 220. Accordingly, the speaker latent space information 232 may be considered during the processing by the acoustic feature predictor 210 and the neural vocoder 220.
  • a speaker embedding vector is provided by a speaker model
  • the speaker embedding vector is merely an exemplary instance of speaker latent space information provided by the speaker model, and those operations or processes discussed in connection with the speaker embedding vector may also be applied for any other instances of the speaker latent space information in a similar way.
  • Basic functions and structures of the acoustic feature predictor 210 may be similar with those of the acoustic feature predictor 110 in FIG. 1, except that it further takes the speaker latent space information 232 into consideration.
  • the acoustic feature predictor 210 may predict acoustic features 204 based on the text input 202 and the speaker latent space information 232.
  • the acoustic feature predictor 210 may comprise an encoder 212, an attention unit 214 and a decoder 216.
  • the speaker latent space information 232 may be combined with an output from the encoder 212, and then passed to the attention unit 214.
  • the attention mechanism in the attention unit 214 may utilize the combination of the latent space information 232 and the output from the encoder 212 for impacting the processing by the decoder 216. Accordingly, acoustic features output by the decoder 216 may be associated with the target speaker.
  • the neural vocoder 220 may generate the speech waveform 206 based on the acoustic features 204 and the speaker latent space information 232. In an implementation, the neural vocoder 220 may generate the speech waveform 206 sample by sample, wherein a collection of the samples forms the speech waveform 206.
  • the multi-speaker neural TTS system 200 may generate highly natural output speech which sounds very similar to the target speaker.
  • FIG. 3 illustrates exemplary implementations of a speaker model 300 according to an embodiment.
  • the speaker model 300 may correspond to the speaker model 230 in FIG. 2.
  • the speaker model 300 may be implemented in various approaches.
  • the speaker model 300 may be implemented through a speaker embedding selector 310.
  • the speaker embedding selector 310 may obtain identity information 302 of a target speaker which may be any types of information capable of identifying the target speaker from other speakers, e.g., a random or assigned number for the target speaker, the name of the target speaker, description information of the target speaker, etc., and the identity information is briefly denoted as “target speaker ID” hereinafter.
  • the speaker embedding selector 310 may try to retrieve a speaker embedding vector corresponding to the target speaker ID 302 from a speaker embedding vector database 312.
  • the speaker embedding vector database 312 may comprise a plurality of speaker embedding vectors corresponding to a plurality of speakers respectively.
  • the speaker embedding vector database 312 may be established through collecting speaker embedding vectors of those speakers in a multi-speaker corpus set during training the multi-speaker neural TTS system, or collecting speaker embedding vectors of previous target speakers during applying the multi-speaker neural TTS system.
  • the speaker model 300 may be implemented through a speaker encoder 320.
  • the speaker encoder 320 may generate a speaker embedding vector corresponding to the target speaker based on a corpus 304 of the target speaker.
  • the corpus 304 of the target speaker may be obtained, which includes a plurality of speech waveforms of the target speaker. Acoustic features may be extracted from the speech waveforms in the corpus 304 through various traditional techniques, and provided to the speaker encoder 320.
  • the speaker encoder 320 may generate the speaker embedding vector corresponding to the target speaker based on the acoustic features of the target speaker.
  • the speaker encoder 320 may be implemented by various techniques.
  • the speaker encoder may be a neural network for generating an embedding vector based on acoustic features.
  • FIG. 4 illustrates an exemplary implementation of a speaker encoder 400 according to an embodiment.
  • the speaker encoder 400 may correspond to the speaker encoder 320 in FIG. 3.
  • the speaker encoder 400 may be based on a neural network which is used for generating a speaker embedding vector 404 based on acoustic features 402.
  • the speaker encoder 400 may sequentially comprise a plurality of convolutional layers 410, average pooling 420, a plurality of fully connected (FC) layers 430 and an affine projection 440.
  • the speaker embedding vector 404 may be formed by L2-normalization of projection output.
  • the speaker encoder 400 may be trained by a corpus set of multiple speakers, and is designed for text-independent speaker recognition. Thus, the speaker encoder 400 may, independently from contents, provide better estimation of speaker embedding vectors.
  • FIG. 5 illustrates an exemplary implementation 500 of a multi-speaker neural TTS system according to an embodiment.
  • the implementation 500 shows an exemplary structure of the multi-speaker neural TTS system 200 in FIG. 2, wherein the speaker model 230 in FIG. 2 is implemented as a single speaker model 510.
  • the speaker model 510 may provide a speaker embedding vector 512 of a target speaker.
  • the speaker embedding vector 512 may be provided to an acoustic feature predictor 520 and a neural vocoder 530 respectively.
  • the acoustic feature predictor 520 may receive a text input 502 and predict acoustic features 504 based on the text input 502 and the speaker embedding vector 512.
  • the neural vocoder 530 may generate a speech waveform 506 corresponding to the text input 502 based on the acoustic features 504 and the speaker embedding vector 512.
  • FIG. 6 illustrates an exemplary implementation 600 of a multi-speaker neural TTS system according to an embodiment.
  • the implementation 600 shows an exemplary structure of the multi-speaker neural TTS system 200 in FIG. 2, wherein the speaker model 230 in FIG. 2 is implemented as two different speaker models 610 and 630.
  • the speaker model 610 and the speaker model 630 are established for providing respective speaker embedding vectors of the target speaker to an acoustic feature predictor 620 and a neural vocoder 640 respectively.
  • the speaker model 610 may provide a speaker embedding vector 612 of the target speaker to the acoustic feature predictor 620.
  • the acoustic feature predictor 620 may receive a text input 602 and predict acoustic features 604 based on the text input 602 and the speaker embedding vector 612.
  • the speaker model 630 may provide a speaker embedding vector 632 of the target speaker to the neural vocoder 640.
  • the neural vocoder 640 may generate a speech waveform 606 corresponding to the text input 602 based on the acoustic features 604 and the speaker embedding vector 632.
  • FIG. 7 illustrates an exemplary implementation of an acoustic feature predictor 700 according to an embodiment.
  • the acoustic feature predictor 700 may correspond to the acoustic feature predictor 210 in FIG. 2, the acoustic feature predictor 520 in FIG. 5, or the acoustic feature predictor 620 in FIG. 6.
  • the acoustic feature predictor 700 may comprise an encoder 710, an attention unit 720 and a decoder 730.
  • a text input 702 may be provided to the encoder 710 which may correspond to the encoder 212 in FIG. 2.
  • a text embedding unit 712 in the encoder 710 may convert the text input 702 into a text embedding vector, and the text embedding vector may be further processed through a plurality of convolutional layers 714 and a bi-directional LSTM (BLSTM) 716 in the encoder 710.
  • the encoder 710 may output text features corresponding to the text input 702, which are further combined with a speaker embedding vector 704.
  • a concatenating unit 718 may be used for providing a combination of the speaker embedding vector 704 and the text features, wherein the speaker embedding vector 704 may correspond to the speaker latent space information 232 in FIG. 2, the speaker embedding vector 512 in FIG. 5, or the speaker embedding vector 612 in FIG. 6.
  • the combination of the speaker embedding vector 704 and the text features may be provided to the attention unit 720 which may correspond to the attention unit 214 in FIG. 2.
  • An attention mechanism implemented in the attention unit 720 may utilize the combination of the speaker embedding vector 702 and the text features to impact the processing by the decoder 730, wherein the decoder 730 may correspond to the decoder 216 in FIG. 2.
  • the decoder 730 may comprise a pre-net 732 consisted of feed-forward layers, a uni-directional LSTM (ULSTM) 734, a linear projection 736 and a post-net 738 consisted of convolutional layers.
  • the ULSTM 734 may receive an input from the pre-net 732 and provide its output to the linear projection 736, and meanwhile the processing by the ULSTM 734 is impacted by the attention unit 720.
  • the linear projection 736 may provide its output to the pre-net 732 and the post-net 738 respectively. Finally, an output from the post-net 738 and the output from the linear projection 736 may be combined so as to produce acoustic features 706.
  • the acoustic features 706 may correspond to the acoustic features 204 in FIG. 2, the acoustic features 504 in FIG. 5, or the acoustic features 604 in FIG. 6.
  • the linear projection 736 may also be used for generating stop tokens.
  • the structure of the acoustic feature predictor 700 in FIG. 7 is exemplary, and depending on specific application designs and requirements, the acoustic feature predictor 700 may be implemented in any other approaches.
  • FIG. 8 illustrates an exemplary implementation of a neural vocoder 800 according to an embodiment.
  • the neural vocoder 800 may correspond to the neural vocoder 220 in FIG. 2, the neural vocoder 530 in FIG. 5, and the neural vocoder 640 in FIG. 6.
  • speaker characteristics may be further considered by the neural vocoder, such that the neural vocoder may get more information of the target speaker in a speaker latent space. Since a speaker embedding vector which reflects speaker characteristics may have different dimensions and different value ranges from acoustic features, the speaker embedding vector and the acoustic features may be firstly transformed into the same dimension with a similar dynamic range of value through, e.g., neural networks.
  • a speaker embedding vector 804 may be input to a neural network 810.
  • the speaker embedding vector 804 may correspond to the speaker latent space information 232 in FIG. 2, the speaker embedding vector 512 in FIG. 5, or the speaker embedding vector 612 in FIG. 6.
  • the neural network 810 may be based on various structures, e.g., a 1 ⁇ 1 convolutional layer 812. Through the neural network 810, a transformed speaker embedding vector may be obtained.
  • acoustic features 802 which may correspond to the acoustic features 706 in FIG. 7, may be input to a neural network 820.
  • the neural network 820 may be based on various structures, e.g., a Quasi-Recurrent Neural Network (QRNN) 822 followed by a 1 ⁇ 1 convolutional layer 824.
  • QRNN Quasi-Recurrent Neural Network
  • transformed acoustic features may be obtained, which may have the same dimension and similar dynamic range of value as compared with the transformed speaker embedding vector.
  • the transformed acoustic features and the transformed speaker embedding vector may be combined together, and further provided to the neural vocoder 800.
  • the neural vocoder 800 may be based on a neural generative model, and may generate a speech waveform 806 based on the combination of the transformed acoustic features and the transformed speaker embedding vector.
  • the neural vocoder 800 may comprise a plurality of dilated convolutional layers 830 which are grouped into a certain number of cycles.
  • the plurality of dilated convolutional layers 830 may take the combination of the transformed acoustic features and the transformed speaker embedding vector as a condition.
  • Skip connection 832 may be performed on outputs by the plurality of dilated convolutional layers 830.
  • the neural vocoder 800 may further sequentially comprise Rectified Linear Unit (ReLU) 834, 1 ⁇ 1 convolutional layer 836, ReLU 838, 1 ⁇ 1 convolutional layer 840, a plurality of feed-forward layers 842 and a MoL unit 844.
  • ReLU Rectified Linear Unit
  • the structure of the neural vocoder 800 in FIG. 8 is exemplary, and depending on specific application designs and requirements, the neural vocoder 800 may be implemented in any other approaches.
  • FIG. 9 illustrates an exemplary process for training a multi-speaker neural TTS system according to an embodiment.
  • a multi-speaker corpus set 920 for training may be prepared.
  • the corpus set 920 may comprise a plurality of corpuses of a plurality of speakers, e.g., corpus 1 of speaker 1, corpus 2 of speaker 2, etc.
  • the corpus set is prepared in consideration of content coverage, speaker variety, style variety, etc.
  • the corpus set 920 has wide content coverage in various knowledge domains.
  • content may refer to information expressed by speech waveforms in a corpus, and content coverage may be assessed with different linguistic contexts, e.g., phoneme, triphone, syllable, etc.
  • speech waveforms in the corpus set 920 may come from various speech sources, e.g., news reporting, lectures, movie dubbings, daily conversations, etc. Accordingly, the collection of corpuses of different speakers in the corpus set 920 may provide rich content or linguistic coverage.
  • speakers in the corpus set 920 may have a balanced distribution in terms of age, gender, accents, etc.
  • the speakers may cover different age ranges, e.g., middle-aged people, old people, young people, etc.
  • the speakers may cover different genders, e.g., male and female.
  • the speakers may have different accents, e.g., American accent, British accent, Australian accent, etc.
  • the speaker variety in the corpus set 920 may help to capture characteristics of different kinds of speakers and generate expressive speeches.
  • styles of speech may be included in the corpus set 920.
  • style of speech may refer to expression manner of a speaker, e.g., telling a story, giving a lecture, daily chatting, etc.
  • the styles of speech may be also associated with requirements of products or users.
  • the corpus set 920 may also be prepared in consideration of any other aspects for improving its variety and richness.
  • the training corpus set 920 prepared as discussed above may facilitate to enable the multi-speaker neural TTS system 910 to leverage content from different speakers, generate close-to-human speech for out-of-domain text inputs, enrich premium voices, create new speeches for a target speaker having only a small corpus, etc.
  • the multi-speaker neural TTS system 910 may comprise at least one speaker model 912, an acoustic feature predictor 914 and a neural vocoder 916.
  • the architecture of the multi-speaker neural TTS system 910 may be the same as the multi-speaker neural TTS system 200 in FIG. 2, and may be specifically implemented in either the structure in FIG. 5 or the structure in FIG. 6.
  • the at least one speaker model 912 is a single speaker model, it may connect to both the acoustic feature predictor 914 and the neural vocoder 916. While if the at least one speaker model 912 comprises two separate speaker models, one speaker model may connect to the acoustic feature predictor 914, and another speaker model may connect to the neural vocoder 916.
  • Training data for any one or any combination of the at least one speaker model 912, the acoustic feature predictor 914 and the neural vocoder 916 may be obtained based on the speech waveforms in the corpus set 920.
  • various derived information may be obtained from the speech waveforms, e.g., text information obtained through applying any existing speech recognition techniques, acoustic features obtained through applying any existing acoustic feature extracting techniques, speaker embedding vectors obtained through applying any existing speaker recognition techniques, etc.
  • the derived information together with the speech waveforms in the corpus set 920 may form various training data for any one or any combination of the at least one speaker model 912, the acoustic feature predictor 914 and the neural vocoder 916.
  • training data consisted of pairs of text input and acoustic features may be used for training the acoustic feature predictor 914
  • training data consisted of pairs of acoustic features and speech waveform may be used for training the neural vocoder 916
  • training data consisted of collections of text input, acoustic features and speech waveform may be used for training the at least one speaker model 912, the acoustic feature predictor 914 and the neural vocoder 916 jointly, and so on.
  • the speaker model, the acoustic feature predictor 914 and the neural vocoder 916 may be trained separately based on the corpus set 920.
  • the speaker model, the acoustic feature predictor 914 and the neural vocoder 916 may also be trained jointly based on the corpus set 920.
  • the acoustic feature predictor 914 and its corresponding speaker model may be trained jointly based on the corpus set 920, and the neural vocoder 916 and its corresponding speaker model may be trained jointly based on the corpus set 920.
  • all of the two speaker models, the acoustic feature predictor 914 and the neural vocoder 916 may also be trained jointly based on the corpus set 920.
  • FIG. 10 illustrates an exemplary process for updating a multi-speaker neural TTS system according to an embodiment.
  • the multi-speaker neural TTS system may be further updated, e.g., retrained, by a subset of training data pairs in a multi-speaker corpus set which has been used for training the multi-speaker neural TTS system.
  • the subset for updating may be one or more corpuses of one or more speakers in the corpus set.
  • the multi-speaker neural TTS system may be further refined by the corpus set.
  • a multi-speaker neural TTS system 1010 may be updated by a corpus set 1020 which has been used for training the multi-speaker neural TTS system 1010.
  • the multi-speaker neural TTS system 1010 may correspond to the multi- speaker neural TTS system 910 in FIG. 9, and the corpus set 1020 may correspond to the corpus set 920 in FIG. 9.
  • a corpus m of a speaker m existing in the corpus set 1020 may be extracted from the corpus set 1020. Speech waveforms in the corpus m may be used for forming training data that is for further updating the multi-speaker neural TTS system 1010.
  • the training data may be formed from the corpus m in a similar approach as discussed above in connection with FIG. 9, and at least one speaker model 1012, an acoustic feature predictor 1014 and a neural vocoder 1016 in the multi-speaker neural TTS system 1010 may be updated in a similar way with the training procedure of the multi-speaker neural TTS system 910 as discussed above in connection with FIG. 9.
  • any one or any combination of the at least one speaker model 1012, the acoustic feature predictor 1014 and the neural vocoder 1016 may be retrained based on the training data.
  • the generalization ability of the multi-speaker neural TTS system 1010 may be further improved.
  • the speaker m may also be deemed as a target speaker.
  • a different number of corpuses in the corpus set may be used for updating or retraining the multi-speaker neural TTS system.
  • FIG. 11 illustrates an exemplary process for updating a multi-speaker neural TTS system according to an embodiment.
  • the multi-speaker neural TTS system may be updated by a corpus of the target speaker, so as to adapt the multi-speaker neural TTS system to the target speaker.
  • a multi-speaker neural TTS system 1110 may be used for generating speech for a target speaker, e.g., a new speaker 1102 which is not existed in a corpus set used for training the multi-speaker neural TTS system 1110.
  • the multi-speaker neural TTS system 1110 may correspond to the multi-speaker neural TTS system 910 in FIG. 9 or the multi-speaker neural TTS system 1010 in FIG. 10.
  • a corpus 1104 of the new speaker 1102 may be obtained and used for forming training data that is for further updating the multi-speaker neural TTS system 1110.
  • the training data may be formed from the corpus 1104 of the new speaker 1102 in a similar approach as discussed above in connection with FIG. 9, and at least one speaker model 1112, an acoustic feature predictor 1114 and a neural vocoder 1116 in the multi-speaker neural TTS system 1110 may be updated in a similar way with the training procedure of the multi-speaker neural TTS system 910 as discussed above in connection with FIG. 9.
  • any one or any combination of the at least one speaker model 1112, the acoustic feature predictor 1114 and the neural vocoder 1116 may be updated or retrained based on the training data.
  • the multi-speaker neural TTS system 1010 may be better adapted to the new speaker, and accordingly generate high quality speech waveforms with high speaker similarity, e.g., better stimulating voices of the new speaker to speak.
  • FIG. 12 illustrates an exemplary processing flow 1200 for generating a speech waveform according to an embodiment.
  • the processing flow 1200 may be performed by a multi-speaker neural TTS system for generating a speech waveform corresponding to a text input in a target speaker’s voice.
  • a text input 1204 may be received.
  • the processing flow 1200 may further determine whether a target speaker 1202 is a new speaker.
  • a speaker embedding vector corresponding to the target speaker may be selected at 1208 from a speaker embedding vector database through, e.g., a speaker embedding selector in a speaker model in the multi-speaker neural TTS system.
  • a corpus of the target speaker 1202 may be obtained at 1210.
  • the corpus of the target speaker 1202 may be used for updating the multi-speaker neural TTS system according to, e.g., the updating process in FIG. 11.
  • a speaker embedding vector of the target speaker 1202 may be generated through, e.g., a speaker encoder in the speaker model in the multi-speaker neural TTS system.
  • acoustic features may be predicted through, e.g., an acoustic feature predictor in the multi-speaker neural TTS system based on the text input 1204 and the speaker embedding vector provided by the step 1208 or 1214.
  • a speech waveform corresponding to the text input 1204 may be generated through, e.g., a neural vocoder in the multi-speaker neural TTS system based on the acoustic features and the speaker embedding vector.
  • the predicting step at 1216 and the generating step at 1218 may utilize the same speaker embedding vector or different speaker embedding vectors. For example, if a single speaker model connects to both the acoustic feature predictor and the neural vocoder, the speaker embedding vector utilized by the predicting step at 1216 and the generating step at 1218 may be the same. While if two different speaker models connect to the acoustic feature predictor and the neural vocoder respectively, the speaker embedding vector utilized by the predicting step at 1216 may be different from that utilized by the generating step at 1218.
  • any steps and step orders in the processing flow 1200 may be adjusted, omitted or replaced according to the embodiments of the present disclosure. Any additional steps may also be added into the processing flow 1200.
  • the target speaker is not a new speaker, e.g., the target speaker 1202 is an existing speaker in the training corpus set
  • the multi-speaker neural TTS system may also be updated based on a corpus of the target speaker in the corpus set according to, e.g., the updating process in FIG. 10.
  • FIG. 13 illustrates an exemplary architecture of a multi-speaker neural TTS system 1300 according to an embodiment.
  • the multi-speaker neural TTS system 1300 may comprise a speaker information extractor 1310, an acoustic feature predictor 1320 and a neural vocoder 1330, wherein the speaker information extractor 1310 may comprise or implement at least one speaker model 1312.
  • the speaker model 1312, the acoustic feature predictor 1320 and the neural vocoder 1330 may correspond to the speaker model 230, the acoustic feature predictor 210 and the neural vocoder 220 in FIG. 2 respectively.
  • the speaker information extractor 1310, the acoustic feature predictor 1320 and the neural vocoder 1330 may be implemented in a hardware level, e.g., implemented by respective hardware units.
  • a hardware unit may refer to one or more processors, one or more chips, one or more central processing units (CPU) , one or more graphic processing unit (GPU) , etc., or any combination thereof.
  • the speaker information extractor 1310, the acoustic feature predictor 1320 and the neural vocoder 1330 may also be implemented in a software level, e.g., implemented by computer programs, processor instructions, etc.
  • the multi-speaker neural TTS system 1300 is not limited to any specific implementation approaches.
  • the speaker information extractor 1310 may be configured for providing speaker latent space information of a target speaker through the at least one speaker model 1312.
  • the acoustic feature predictor 1320 may be configured for predicting at least one acoustic feature based on a text input and the speaker latent space information.
  • the neural vocoder 1330 may be configured for generating a speech waveform corresponding to the text input based on the at least one acoustic feature and the speaker latent space information.
  • the at least one speaker model 1312 may comprise: a first speaker model, configured for providing first speaker latent space information in the speaker latent space information; and a second speaker model, configured for providing second speaker latent space information in the speaker latent space information.
  • the acoustic feature predictor 1320 may be configured for predicting the at least one acoustic feature based on the text input and the first speaker latent space information.
  • the neural vocoder 1330 may be configured for generating the speech waveform based on the at least one acoustic feature and the second speaker latent space information.
  • At least one of the at least one speaker model 1312, the acoustic feature predictor 1320 and the neural vocoder 1330 was pre-trained separately based on a plurality of corpuses of a plurality of speakers in a corpus set, and/or any two or more of the at least one speaker model 1312, the acoustic feature predictor 1320 and the neural vocoder 1330 were pre-trained jointly based on the plurality of corpuses of the plurality of speakers.
  • the at least one speaker model 1312 comprises a first speaker model and a second speaker model
  • the first speaker model and the acoustic feature predictor were pre-trained jointly based on a plurality of corpuses of a plurality of speakers in a corpus set
  • the second speaker model and the neural vocoder were pre-trained jointly based on the plurality of corpuses of the plurality of speakers.
  • At least one of the at least one speaker model 1312, the acoustic feature predictor 1320 and the neural vocoder 1330 may be updated separately based on a corpus of the target speaker, and/or any two or more of the at least one speaker model 1312, the acoustic feature predictor 1320 and the neural vocoder 1330 may be updated jointly based on the corpus of the target speaker.
  • the at least one speaker model 1312 comprises a first speaker model and a second speaker model
  • the first speaker model and the acoustic feature predictor may be updated jointly based on a corpus of the target speaker
  • the second speaker model and the neural vocoder may be updated jointly based on the corpus of the target speaker.
  • the speaker information extractor 1310, the acoustic feature predictor 1320 and the neural vocoder 1330 may also be configured for performing any other processes or operations according to the embodiments of the present disclosure as mentioned above.
  • FIG. 14 illustrates a flowchart of an exemplary method 1400 for generating speech through multi-speaker neural TTS synthesis according to an embodiment.
  • a text input may be received.
  • speaker latent space information of a target speaker may be provided through at least one speaker model.
  • At 1430, at least one acoustic feature may be predicted through an acoustic feature predictor based on the text input and the speaker latent space information.
  • a speech waveform corresponding to the text input may be generated through a neural vocoder based on the at least one acoustic feature and the speaker latent space information.
  • the at least one speaker model may comprise a single speaker model.
  • the at least one speaker model may comprise a first speaker model and a second speaker model.
  • the providing may comprise: providing first speaker latent space information through the first speaker model; and providing second speaker latent space information through the second speaker model.
  • the predicting may comprise: predicting the at least one acoustic feature based on the text input and the first speaker latent space information.
  • the generating may comprise: generating the speech waveform based on the at least one acoustic feature and the second speaker latent space information.
  • the providing may comprise: generating a speaker embedding vector of the target speaker based on a corpus of the target speaker; or selecting a speaker embedding vector of the target speaker from a speaker embedding vector database.
  • the method 1400 may further comprise: generating at least one transformed acoustic feature based on the at least one acoustic feature through a first neural network; and generating transformed speaker latent space information based on the speaker latent space information through a second neural network.
  • the generating the speech waveform may comprise: generating the speech waveform based on a combination of the at least one transformed acoustic feature and the transformed speaker latent space information.
  • the method 1400 may further comprise: updating at least one of the at least one speaker model, the acoustic feature predictor and the neural vocoder separately based on a corpus of the target speaker; and/or updating any two or more of the at least one speaker model, the acoustic feature predictor and the neural vocoder jointly based on the corpus of the target speaker.
  • the method 1400 may further comprise: updating the first speaker model and the acoustic feature predictor jointly based on a corpus of the target speaker; and/or updating the second speaker model and the neural vocoder jointly based on the corpus of the target speaker.
  • At least one of the at least one speaker model, the acoustic feature predictor and the neural vocoder was pre-trained separately based on a plurality of corpuses of a plurality of speakers, and/or any two or more of the at least one speaker model, the acoustic feature predictor and the neural vocoder were pre-trained jointly based on the plurality of corpuses of the plurality of speakers.
  • the at least one speaker model comprises a first speaker model and a second speaker model
  • the first speaker model and the acoustic feature predictor were pre-trained jointly based on a plurality of corpuses of a plurality of speakers
  • the second speaker model and the neural vocoder were pre-trained jointly based on the plurality of corpuses of the plurality of speakers.
  • the plurality of corpuses are prepared based on at least one of content coverage, speaker variety and style variety.
  • the method 1400 may further comprise any steps/processes for generating speech through multi-speaker neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
  • FIG. 15 illustrates an exemplary apparatus 1500 for generating speech through multi-speaker neural TTS synthesis according to an embodiment.
  • the apparatus 1500 may comprise: a text input receiving module 1510, for receiving a text input; a speaker latent space information providing module 1520, for providing, through at least one speaker model, speaker latent space information of a target speaker; an acoustic feature predicting module 1530, for predicting, through an acoustic feature predictor, at least one acoustic feature based on the text input and the speaker latent space information; and a speech waveform generating module 1540, for generating, through a neural vocoder, a speech waveform corresponding to the text input based on the at least one acoustic feature and the speaker latent space information.
  • the apparatus 1500 may also comprise any other modules configured for generating speech through multi-speaker neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
  • FIG. 16 illustrates an exemplary apparatus 1600 for generating speech through multi-speaker neural TTS synthesis according to an embodiment.
  • the apparatus 1600 may comprise at least one processor 1610 and a memory 1620 storing computer-executable instructions.
  • the at least one processor 1610 may: receive a text input; provide, through at least one speaker model, speaker latent space information of a target speaker; predict, through an acoustic feature predictor, at least one acoustic feature based on the text input and the speaker latent space information; and generate, through a neural vocoder, a speech waveform corresponding to the text input based on the at least one acoustic feature and the speaker latent space information.
  • the at least one processor 1610 may be further configured for performing any operations of the methods for generating speech through multi-speaker neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
  • the embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for generating speech through multi-speaker neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • a state machine gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk.
  • RAM random access memory
  • ROM read only memory
  • PROM programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically erasable PROM

Abstract

La présente invention concerne un procédé pour générer une parole par l'intermédiaire d'une synthèse texte-vers-parole (TTS) neuronale multilocuteurs. Une entrée de texte peut être reçue (1410). Des informations d'espace latent de locuteur d'un locuteur cible peuvent être fournies par l'intermédiaire d'au moins un modèle de locuteur (1420). Au moins une caractéristique acoustique peut être prédite par l'intermédiaire d'un dispositif de prédiction de caractéristique acoustique sur la base de l'entrée de texte et des informations d'espace latent de locuteur (1430). Une forme d'onde de parole correspondant à l'entrée de texte peut être générée par l'intermédiaire d'un vocodeur neuronal sur la base de l'au moins une caractéristique acoustique et des informations d'espace latent de locuteur (1440).
PCT/CN2018/120300 2018-12-11 2018-12-11 Synthèse texte-vers-parole neuronale multilocuteurs WO2020118521A1 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201880091361.8A CN111954903B (zh) 2018-12-11 2018-12-11 多说话者神经文本到语音合成
PCT/CN2018/120300 WO2020118521A1 (fr) 2018-12-11 2018-12-11 Synthèse texte-vers-parole neuronale multilocuteurs
US17/293,640 US20220013106A1 (en) 2018-12-11 2018-12-11 Multi-speaker neural text-to-speech synthesis
EP18942805.5A EP3895159A4 (fr) 2018-12-11 2018-12-11 Synthèse texte-vers-parole neuronale multilocuteurs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/120300 WO2020118521A1 (fr) 2018-12-11 2018-12-11 Synthèse texte-vers-parole neuronale multilocuteurs

Publications (1)

Publication Number Publication Date
WO2020118521A1 true WO2020118521A1 (fr) 2020-06-18

Family

ID=71075391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/120300 WO2020118521A1 (fr) 2018-12-11 2018-12-11 Synthèse texte-vers-parole neuronale multilocuteurs

Country Status (4)

Country Link
US (1) US20220013106A1 (fr)
EP (1) EP3895159A4 (fr)
CN (1) CN111954903B (fr)
WO (1) WO2020118521A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112652292A (zh) * 2020-11-13 2021-04-13 北京有竹居网络技术有限公司 用于生成音频的方法、装置、设备和介质
RU2775821C2 (ru) * 2020-09-15 2022-07-11 Общество С Ограниченной Ответственностью «Яндекс» Способ и сервер для преобразования текста в речь

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
DK201770428A1 (en) 2017-05-12 2019-02-18 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11687829B2 (en) * 2020-04-28 2023-06-27 Optum Services (Ireland) Limited Artificial intelligence recommendation system
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11508380B2 (en) * 2020-05-26 2022-11-22 Apple Inc. Personalized voices for text messaging
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
US11335324B2 (en) * 2020-08-31 2022-05-17 Google Llc Synthesized data augmentation using voice conversion and speech recognition models
US20220180886A1 (en) * 2020-12-08 2022-06-09 Fuliang Weng Methods for clear call under noisy conditions
CN112349269A (zh) * 2020-12-11 2021-02-09 平安科技(深圳)有限公司 语音合成方法、装置、设备及存储介质
CN112786006A (zh) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备
CN112967728B (zh) * 2021-05-19 2021-07-30 北京世纪好未来教育科技有限公司 结合声传递函数的端到端语音合成方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835074A (zh) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 一种结合高层描述信息和模型自适应的说话人转换方法
CN103021418A (zh) * 2012-12-13 2013-04-03 南京邮电大学 一种面向多时间尺度韵律特征的语音转换方法
US9384728B2 (en) * 2014-09-30 2016-07-05 International Business Machines Corporation Synthesizing an aggregate voice
KR101665882B1 (ko) * 2015-08-20 2016-10-13 한국과학기술원 음색변환과 음성dna를 이용한 음성합성 기술 및 장치
US20180336880A1 (en) 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102203853B (zh) * 2010-01-04 2013-02-27 株式会社东芝 合成语音的方法和装置
US10438581B2 (en) * 2013-07-31 2019-10-08 Google Llc Speech recognition using neural networks
US9824681B2 (en) * 2014-09-11 2017-11-21 Microsoft Technology Licensing, Llc Text-to-speech with emotional content
EP3151239A1 (fr) * 2015-09-29 2017-04-05 Yandex Europe AG Procedes et systemes pour la synthese de texte en discours
CN105206258B (zh) * 2015-10-19 2018-05-04 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
US10249289B2 (en) * 2017-03-14 2019-04-02 Google Llc Text-to-speech synthesis using an autoencoder
CN107103900B (zh) * 2017-06-06 2020-03-31 西北师范大学 一种跨语言情感语音合成方法及系统
US10796686B2 (en) * 2017-10-19 2020-10-06 Baidu Usa Llc Systems and methods for neural text-to-speech using convolutional sequence learning
CN108986793A (zh) * 2018-09-28 2018-12-11 北京百度网讯科技有限公司 翻译处理方法、装置及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835074A (zh) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 一种结合高层描述信息和模型自适应的说话人转换方法
CN103021418A (zh) * 2012-12-13 2013-04-03 南京邮电大学 一种面向多时间尺度韵律特征的语音转换方法
US9384728B2 (en) * 2014-09-30 2016-07-05 International Business Machines Corporation Synthesizing an aggregate voice
KR101665882B1 (ko) * 2015-08-20 2016-10-13 한국과학기술원 음색변환과 음성dna를 이용한 음성합성 기술 및 장치
US20180336880A1 (en) 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3895159A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2775821C2 (ru) * 2020-09-15 2022-07-11 Общество С Ограниченной Ответственностью «Яндекс» Способ и сервер для преобразования текста в речь
CN112652292A (zh) * 2020-11-13 2021-04-13 北京有竹居网络技术有限公司 用于生成音频的方法、装置、设备和介质

Also Published As

Publication number Publication date
EP3895159A1 (fr) 2021-10-20
US20220013106A1 (en) 2022-01-13
CN111954903B (zh) 2024-03-15
EP3895159A4 (fr) 2022-06-29
CN111954903A (zh) 2020-11-17

Similar Documents

Publication Publication Date Title
WO2020118521A1 (fr) Synthèse texte-vers-parole neuronale multilocuteurs
US11769483B2 (en) Multilingual text-to-speech synthesis
US11837216B2 (en) Speech recognition using unspoken text and speech synthesis
US20230064749A1 (en) Two-Level Speech Prosody Transfer
EP3994683B1 (fr) Synthèse texte-parole neuronale multilingue
US20220277728A1 (en) Paragraph synthesis with cross utterance features for neural TTS
US20230169953A1 (en) Phrase-based end-to-end text-to-speech (tts) synthesis
JP2024505076A (ja) 多様で自然なテキスト読み上げサンプルを生成する
CN113761841B (zh) 将文本数据转换为声学特征的方法
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
Chen et al. The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion.
Li et al. End-to-end mongolian text-to-speech system
Liu et al. Controllable accented text-to-speech synthesis
Zhang et al. Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
US20230018384A1 (en) Two-Level Text-To-Speech Systems Using Synthetic Training Data
US20220068256A1 (en) Building a Text-to-Speech System from a Small Amount of Speech Data
Zhang et al. Zero-shot multi-speaker accent TTS with limited accent data
Kim et al. SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18942805

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018942805

Country of ref document: EP

Effective date: 20210712