WO2021231050A1 - Automatic audio content generation - Google Patents

Automatic audio content generation Download PDF

Info

Publication number
WO2021231050A1
WO2021231050A1 PCT/US2021/028297 US2021028297W WO2021231050A1 WO 2021231050 A1 WO2021231050 A1 WO 2021231050A1 US 2021028297 W US2021028297 W US 2021028297W WO 2021231050 A1 WO2021231050 A1 WO 2021231050A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
context
character
model
tts
Prior art date
Application number
PCT/US2021/028297
Other languages
French (fr)
Inventor
Xi Wang
Shaofei ZHANG
Yujia XIAO
Yueying Liu
Lei He
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2021231050A1 publication Critical patent/WO2021231050A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • Text-to- Speech (TTS) synthesis aims at generating a corresponding speech waveform based on a text input.
  • a conventional TTS model or system may predict acoustic features based on a textual input and further generate a speech waveform based on the predicted acoustic features.
  • the TTS model may be applied to convert various types of text content into audio content, e.g., to convert books in text format into audiobooks, etc.
  • Embodiments of the present disclosure propose a method and apparatus for automatic audio content generation.
  • a text may be obtained.
  • Context corresponding to the text may be constructed.
  • Reference factors may be determined based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text.
  • a speech waveform corresponding to the text may be generated based at least on the text and the reference factors.
  • FIG.l illustrates an exemplary conventional TTS model.
  • FIG.2 illustrates an exemplary process for automatic audio content generation according to an embodiment.
  • FIG.3 illustrates an exemplary process for automatic audio content generation according to an embodiment.
  • FIG.4 illustrates an exemplary process for preparing training data according to an embodiment.
  • FIG.5 illustrates an exemplary process for predicting a character category and a style according to an embodiment.
  • FIG.6 illustrates an exemplary implementation of speech synthesis employing a linguistic feature-based TTS model according to an embodiment.
  • FIG.7 illustrates an exemplary implementation of an encoder in a linguistic feature-based TTS model according to an embodiment.
  • FIG.8 illustrates an exemplary implementation of speech synthesis employing a context-based TTS model according to an embodiment.
  • FIG.9 illustrates an exemplary implementation of a context encoder in a context-based TTS model.
  • FIG.10 illustrates an exemplary process for predicting a character and selecting a TTS model according to an embodiment.
  • FIG.11 illustrates an exemplary implementation of speech synthesis employing a linguistic feature-based TTS model according to an embodiment.
  • FIG.12 illustrates an exemplary implementation of speech synthesis employing a context-based TTS model according to an embodiment.
  • FIG.13 illustrates an exemplary process for updating audio content according to an embodiment.
  • FIG.14 illustrates a flowchart of an exemplary method for automatic audio content generation according to an embodiment.
  • FIG.15 illustrates an exemplary apparatus for automatic audio content generation according to an embodiment.
  • FIG.16 illustrates an exemplary apparatus for automatic audio content generation according to an embodiment.
  • Audiobooks are increasingly used for entertainment and education. Traditional audiobooks are recorded manually. For example, a professional narrator or voice actor will tell previously prepared text content, and an audiobook corresponding to the text content is obtained through recording the narrator's narration. This approach of recording an audiobook will be very time-consuming and costly, and further resulting not being able to obtain corresponding audiobooks in time for a large number of text books.
  • TTS synthesis may improve the efficiency of audiobook generation and reduce costs.
  • Most TTS models synthesize speech separately for each text sentence.
  • the speech synthesized in this way usually has, e.g., a single utterance prosody, and thus sounds monotonous and boring.
  • this single utterance prosody is repeatedly applied to the entire chapters and paragraphs of an audiobook, it will significantly reduce the quality of the audiobook.
  • the monotonous speech expression will further reduce the appeal of the audiobook.
  • Embodiments of the present disclosure propose to perform automatic and high- quality audio content generation for text content.
  • text content may broadly refer to any content in text form, e.g., a book, a script, an article, etc.
  • audio content may broadly refer to any content in audio form, e.g., an audiobook, a video dubbing, news broadcast, etc.
  • conversion of a text story book into an audiobook is taken as an example in the various sections discussed below, it should be appreciated that the embodiments of the present disclosure may also be applied to convert text content in any other form into audio content in any other form.
  • the embodiments of the present disclosure may construct a context for a text sentence in text content, and use the context for TTS synthesis of the text sentence, instead of only considering the text sentence itself in the TTS synthesis.
  • context of a text sentence may provide rich expression information about the text sentence, which may be used as reference factors for TTS synthesis, so that the synthesized speech will be more expressive, more vivid, and more diverse.
  • Various reference factors may be determined based on the context, e.g., character, character category, style, character personality, etc. corresponding to the text sentence.
  • a character may refer to a specific people, an anthropomorphic animal, and an anthropomorphic object, etc., that appears in text content and has the ability to talk.
  • a character category may refer to a category attribute of a character, e.g., gender, age, etc.
  • a style may refer to an emotion type corresponding to a text sentence, e.g., happy, sad, etc.
  • a character personality may refer to a personality created for a character in text content, e.g., gentle, cheerful, evil, etc.
  • speech with different speech characteristics may be synthesized for different characters, character categories, styles, etc., respectively, e.g., speech with different timbres, voice styles, etc.
  • speech with different timbres, voice styles, etc. may be synthesized for different characters, character categories, styles, etc., respectively, e.g., speech with different timbres, voice styles, etc.
  • the expressiveness, vividness, and diversity of the synthesized speech may be enhanced, thereby significantly improving the quality of the synthesized speech.
  • the embodiments of the present disclosure may be applied to a scenario where audio content is synthesized with a voice of a single speaker, which may also be referred to as single-narrator audio content generation.
  • the single speaker may be a pre-designated target speaker, and the voice of the target speaker may be used to simulate or play different types of characters in text content.
  • a character category may be considered in the speech synthesis, so that speech corresponding to different character categories may be generated with the voice of the target speaker, e.g., speech corresponding to young men, speech corresponding to old women, etc.
  • a style may also be considered in the speech synthesis, so that different styles of speech may be generated with the voice of the target speaker, e.g., speech corresponding to an emotion type "happy”, speech corresponding to an emotion type "sad”, etc.
  • speech corresponding to an emotion type "happy” e.g., speech corresponding to an emotion type "happy”
  • speech corresponding to an emotion type "sad” etc.
  • the embodiments of the present disclosure may be applied to a scenario where audio content is synthesized with voices of multiple speakers, which may also be referred to as multi -narrator audio content generation.
  • the voices of different speakers may be used for different characters in text content.
  • These speakers may be predetermined candidate speakers with different attributes.
  • For a specific character, which speaker's voice to use may be determined with reference to at least a character category, a character personality, etc. For example, assuming that the character Mike is a cheerful young male, voice of a speaker with attributes such as ⁇ young>, ⁇ male>, and ⁇ cheerful>, etc., may be selected to generate Mike's speech.
  • a style may also be considered in the speech synthesis, so that different styles of speech may be generated with a voice of a speaker corresponding to a character, thereby enhancing the expressiveness and vividness, etc., of the synthesized speech.
  • a linguistic feature-based TTS model may be employed.
  • a linguistic feature-based TTS model may be trained with a voice corpus of a target speaker, wherein the model may at least consider a character category, an optional style, etc. to generate speech.
  • a linguistic feature-based TTS model may be trained with voice corpuses of different candidate speakers, and a corresponding version of the model may be selected for a specific character to generate speech for the character, or furthermore, different styles of speech for the character may be generated through considering styles.
  • a context-based TTS model may be employed.
  • a context-based TTS model may be trained with a voice corpus of a target speaker, wherein the model may at least consider a context of a text sentence, character category, optional style, etc. to generate speech.
  • different versions of a context-based TTS model may be trained with voice corpuses of different candidate speakers, and a corresponding version of the model may be selected for a specific character to generate speech for the character, or furthermore, different styles of speech for the character may be generated through considering styles.
  • the embodiments of the present disclosure also provide a flexible customization mechanism for audio content. For example, users may adjust or customize audio content through a visual customization platform. Various parameters involved in speech synthesis may be modified or set to adjust any part of audio content, so that a specific utterance in the audio content may have a desired character category, desired style, etc. Since the linguistic feature-based TTS model has explicit feature input, it may be used to update the audio content in response to an adjustment indication of a user.
  • the embodiments of the present disclosure may flexibly use a linguistic feature-based TTS model and/or a context-based TTS model to automatically generate high-quality audio content.
  • the linguistic feature-based TTS model may generate high- quality speech through considering reference factors determined based on a context, and may be used to adjust or update the generated audio content.
  • the context-based TTS model considers not only the reference factors determined based on the context in the speech synthesis, but also context features extracted from the context itself, thereby resulting in more coordinated speech synthesis for long text.
  • utterances of a character will have stronger expressiveness, vividness and diversity, so that appeal and interestingness, etc., of the audio content may be significantly improved.
  • Automatic audio content generation is fast and low- cost.
  • the embodiments of the present disclosure convert text content into high-quality audio content in a fully automatic manner, the barriers to audio content creation will be further reduced, so that not only professional voice actors but also ordinary users may easily and quickly create their own unique audio content.
  • FIG.l illustrates an exemplary conventional TTS model 100.
  • the TTS model 100 may be configured to receive a text 102, and generate a speech waveform 108 corresponding to the text 102.
  • the text 102 may also be referred to as a text sentence, which may comprise one or more words, phrases, sentences, paragraphs, etc., and herein, the terms "text” and "text sentence” may be used interchangeably.
  • the text 102 may be first converted into an element sequence, such as a phone sequence, a grapheme sequence, a character sequence, etc. which is provided to the TTS model 100 as input.
  • an input "text” may broadly refer to a word, a phrase, a sentence, etc. included in the text, or an element sequence obtained from the text, e.g., a phone sequence, a grapheme sequence, a character sequence, etc.
  • the TTS model 100 may comprise an acoustic model 110.
  • the acoustic model 110 may predict or generate acoustic features 106 according to the text 102.
  • the acoustic features 106 may comprise various TTS acoustic features, e.g., mel-spectrum, linear spectrum pairs (LSP), etc.
  • the acoustic model 110 may be based on various model architectures, e.g., a sequence-to-sequence model architecture, etc.
  • FIG.l illustrates an exemplary sequence-to-sequence acoustic model 110, which may comprise an encoder 112, an attention module 114, and a decoder 116.
  • the encoder 112 may convert information contained in the text 102 into a space that is more robust and more suitable for learning alignment with acoustic features.
  • the encoder 112 may convert the information in the text 102 into a state sequence in the space, which may also be referred as an encoder state sequence.
  • Each state in the state sequence corresponds to a phone, grapheme, and character etc. in the text 102.
  • the attention module 114 may implement an attention mechanism. The attention mechanism establishes a connection between the encoder 112 and the decoder 116, to facilitate aligning between text features output by the encoder 112 and the acoustic features.
  • a connection between each decoding step and the encoder state may be established, and the connection may indicate which encoder state each decoding step should correspond to with what weight.
  • the attention module 114 may take the encoder state sequence and the output of the previous step of the decoder as input, and generate an attention vector that represent weights of the next decoding step to align with various encoder states.
  • the decoder 116 may map the state sequence output by the encoder 112 to the acoustic features 106 under the influence of the attention mechanism in the attention module 114. In each decoding step, the decoder 116 may take the attention vector output by the attention module 114 and the output of the previous step of the decoder as input, and output the acoustic features of one or more frames, e.g., the mel-spectrum.
  • the TTS model 100 may comprise a vocoder 120.
  • the vocoder 120 may generate a speech waveform 108 based on the acoustic features 106 predicted by the acoustic model 110.
  • FIG.2 illustrates an exemplary process 200 for automatic audio content generation according to an embodiment.
  • the process 200 may be applied to a scenario of single-narrator audio content generation.
  • Text content 210 is a processed object of automatic audio content generation according to an embodiment, e.g., a textual story book, and it is intended to generate audio content, e.g., an audiobook, through performing the process 200 on a plurality of texts included in the text content 210 respectively.
  • audio content e.g., an audiobook
  • a text 212 is currently extracted from the text content 210, and it is intended to generate a speech waveform corresponding to the text 212 through performing the process 200.
  • a context 222 corresponding to the text 212 may be constructed.
  • the context 222 may comprise one or more texts adjacent to the text 212 in the text content 210.
  • the context 222 may comprise at least one sentence before the text 212 and/or at least one sentence after the text 212. Therefore, the context 222 is actually a sentence sequence corresponding to the text 212.
  • the context 222 may also include more text or all text in the text content 210.
  • reference factors 232 may be determined based at least on the context 222.
  • the reference factors 232 may influence characteristics of synthesized speech in subsequent TTS speech synthesis.
  • the reference factors 232 may comprise a character category corresponding to the text 212, which indicates attributes such as age, gender, etc. of the character corresponding to the text 212. For example, if the text 212 is an utterance spoken by a young male, it may be determined that the character category corresponding to the text 212 is ⁇ youth>, ⁇ male>, etc. Generally, different character categories may correspond to different speech characteristics.
  • the reference factors 232 may also include a style corresponding to the text 212, which indicates, e.g., in which emotion type the text 212 is spoken.
  • the style corresponding to the text 212 is ⁇ angry>.
  • different styles may correspond to different speech characteristics.
  • the character category and the style may individually or jointly influence the characteristics of the synthesized speech.
  • the character category and the style, etc. may be predicted at 230 based on the context 222 through a previously- trained prediction model.
  • the TTS model 240 previously trained for a target speaker may be employed to generate the speech waveform.
  • the target speaker may be a narrator previously automatically determined or a narrator designated by the user.
  • the TTS model 240 may synthesize speech with a voice of the target speaker.
  • the TTS model 240 may be a linguistic feature-based TTS model, and the linguistic feature-based TTS model 240 here may consider at least reference factors to synthesize speech, which is different from a conventional linguistic feature-based TTS model.
  • the linguistic feature-based TTS model 240 may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212 and a character category in the case that the reference factors 232 includes the character category, or may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212, a character category, and a style in the case that the reference factors 232 includes both the character category and the style.
  • the TTS model 240 may be a context-based TTS model, and the context-based TTS model 240 here may consider at least reference factors to synthesize speech, which is different from a conventional context-based TTS model.
  • the context-based TTS model 240 may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212, the context 222 and a character category in the case that the reference factors 232 includes the character category, or may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212, the context 222, a character category, and a style in the case that the reference factors 232 includes both the character category and the style.
  • a plurality of speech waveforms corresponding to a plurality of texts included in the text content 210 may be generated through the process 200. All these speech waveforms may together form audio content corresponding to the text content 210.
  • the audio content may comprise speech of different character categories and/or different styles synthesized with the voice of the target speaker.
  • FIG.3 illustrates an exemplary process 300 for automatic audio content generation according to an embodiment.
  • the process 300 may be applied to a scenario of multi -narrator audio content generation.
  • a context 322 corresponding to the text 312 may be constructed.
  • the context 322 may comprise one or more texts adjacent to the text 312 in the text content 310.
  • the context 322 may also include more text or all text in the text content 310.
  • reference factors 332 may be determined based at least on the context 322, which is used to influence characteristics of synthesized speech in subsequent TTS speech synthesis.
  • the reference factors 332 may comprise a character category corresponding to the text 312.
  • the reference factors 332 may also comprise a character personality corresponding to the text 312, which indicates a character personality corresponding to the text 312. For example, if the text 312 is an utterance spoken by an evil old witch, it may be determined that the character personality corresponding to the text 312 is ⁇ evil>. Generally, different character personalities may correspond to different speech characteristics.
  • the reference factors 332 may also comprise a character corresponding to the text 312, which indicates the text 312 is spoken by which character in the text content 310.
  • the reference factors 332 may also comprise a style corresponding to the text 312.
  • different reference factors may be predicted based on the context 322 through different previously-trained prediction models. These prediction models may comprise, e.g., a prediction model for predicting a character category and a style, a prediction model for predicting a character personality, a prediction models for predicting a character, etc.
  • a TTS model to be used may be selected from a candidate TTS model library 350 which is previously prepared.
  • the candidate TTS model library 350 may comprise a plurality of candidate TTS models previously-trained for different candidate speakers.
  • Each candidate speaker may have attributes in at least one of character category, character personality, character, etc.
  • attributes of candidate speaker 1 may comprise ⁇ old-aged>, ⁇ female>, ⁇ evil> and ⁇ witch>
  • attributes of candidate speaker 2 may comprise ⁇ middle-aged>, ⁇ male> and ⁇ cheerful>, etc.
  • a candidate speaker corresponding to the text 312 may be determined with at least one of the character category, the character personality, and the character in the reference factors 332, and a TTS model corresponding to the determined candidate speaker may be selected accordingly.
  • a TTS model 360 is selected from the candidate TTS model library 350 at 340 to be used to generate a speech waveform for the text 312.
  • the TTS model 360 may synthesize speech with a voice of a speaker corresponding to the model.
  • the TTS model 360 may be a linguistic feature-based TTS model, which may generate a speech waveform 370 corresponding to the text 312 based at least on the text 312.
  • the speech waveform 370 may be generated further based on the style through the linguistic feature- based TTS model 360.
  • the TTS model 360 may be a context-based TTS model, which may generate a speech waveform 370 corresponding to the text 312 based at least on the text 312 and the context 322.
  • the speech waveform 370 may be generated further based on the style through the context-based TTS model 360.
  • a plurality of speech waveforms corresponding to a plurality of texts included in the text content 310 may be generated through the process 300. All these speech waveforms may together form audio content corresponding to the text content 310.
  • the audio content may comprise speech synthesized with voices of different speakers automatically assigned to different characters, and optionally, the speech may have different styles.
  • FIG.4 illustrates an exemplary process 400 for preparing training data according to an embodiment.
  • a plurality of sets of matching audio content 402 and text content 404 may be previously obtained, e.g., audiobooks and corresponding text story books.
  • automatic segmentation may be performed on the audio content 402.
  • the audio content 402 may be automatically segmented into a plurality of audio segments, and each audio segment may correspond to one or more speech utterances.
  • the automatic segmentation at 410 may be performed through any known audio segmentation technique.
  • post-processing may be performed on the plurality of segmented audio segments with the text content 404.
  • post-processing at 420 may comprise utterance completion re-segmentation with the text content 404.
  • the text content 404 may be easily segmented into a plurality of text sentences through any known text segmentation technique, and then an audio segment corresponding to each text sentence may be determined with reference to the text sentence.
  • one or more audio segments obtained at 410 may be segmented or combined to match the text sentence.
  • an audio segment and a text sentence may be aligned in time, and a plurality of pairs of ⁇ text sentence, audio segment> may be formed.
  • post-processing at 420 may comprise classifying audio segments into narration and conversation. For example, through performing classification processing for identifying narration and conversation on a text sentence, an audio segment corresponding to the text sentence may be classified into narration and conversation.
  • a label may be added to a ⁇ text sentence, audio segment> pair involving a conversation.
  • the label may comprise, e.g., character category, style, etc.
  • a character category, style, etc. of each ⁇ text sentence, audio segment> pair may be determined by means of automatic clustering.
  • a character category, style, etc. of each ⁇ text sentence, audio segment> pair may be labeled by human.
  • Each piece of training data may have a form of, e.g., ⁇ text sentence, audio segment, character category, style>.
  • the training data 406 may be further applied to train a prediction model for predicting a character category and a style corresponding to a text, a TTS model for generating a speech waveform, etc.
  • any other form of training data may also be prepared in a similar manner.
  • the label added at 430 may also include character, character personality, etc.
  • FIG.5 illustrates an exemplary process 500 for predicting a character category and a style according to an embodiment.
  • the process 500 may be performed by a prediction model for predicting a character category and a style.
  • the prediction model may automatically assign a character category and a style to a text.
  • a context corresponding to the text 502 may be constructed at 510.
  • the processing at 510 may be similar to the processing at 220 in FIG.2.
  • the constructed context may be provided to a previously-trained language model 520.
  • the language model 520 is used to model and represent text information, and it may be trained to generate a latent space representation for input text, e.g., an embedding representation.
  • the language model 520 may be based on any appropriate technology, e.g., a Bidirectional Encoder Representations from Transformers (BERT), etc.
  • the embedding representation output by the language model 520 may be sequentially provided to a projection layer 530 and a softmax layer 540.
  • the projection layer 530 may convert the embedding representation into a projected representation, and the softmax layer 540 may calculate probabilities of different character categories and probabilities of different styles based on the projected representation, thereby finally determining a character category 504 and a style 506 corresponding to the text 502.
  • a prediction model for performing the process 500 may be trained with the training data obtained through the process 400 of FIG.4. For example, training data in the form of ⁇ text, character category, style> may be used to train the prediction model. In a stage of applying the trained prediction model, the prediction model may predict a character category and a style corresponding to an input text based on the input text.
  • the prediction model may jointly predict the character category and the style
  • the character category and the style may be separately predicted by employing separate prediction models through a similar process.
  • FIG.6 illustrates an exemplary implementation 600 of speech synthesis employing a linguistic feature-based TTS model according to an embodiment.
  • the implementation 600 may be applied to a scenario of single-narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 200 in FIG.2. Assume that it is desired to generate a speech waveform for a text 602 in FIG.6.
  • a style 612 and a character category 614 corresponding to the text 602 may be predicted through a prediction model 610.
  • the prediction model 610 may perform prediction based on, e.g., the process 500 of FIG.5.
  • a character category embedding representation 616 corresponding to the character category 614 may be obtained through, e.g., a character category embedding look-up table (LUT).
  • LUT character category embedding look-up table
  • the style 612 and the character category embedding representation 616 may be cascaded to obtain a cascaded representation.
  • a corresponding latent representation may be generated based on the cascaded representation.
  • the latent representation generation at 620 may be performed in various ways, e.g., Gaussian Mixture Variational Auto Encoders (GMVAE), Vector Quantization VAE (VQ-VAE), VAE, Global Style Representation (GST), etc.
  • GMVAE Gaussian Mixture Variational Auto Encoders
  • VQ-VAE Vector Quantization VAE
  • VAE Global Style Representation
  • GST Global Style Representation
  • front-end analysis may be performed on the text 602 to extract a phone feature 632 and a prosody feature 634. Any known TTS front-end analysis technique may be used to perform the front-end analysis at 630.
  • the phone feature 632 may refer to a phone sequence extracted from the text 602.
  • the prosody feature 634 may refer to prosody information corresponding to the text 602, e.g., break, accent, intonation, rate, etc.
  • the phone feature 632 and the prosody feature 634 may be encoded with an encoder 640.
  • the encoder 640 may be based on any architecture. As an example, an instance of the encoder 640 is presented in FIG.7.
  • FIG.7 illustrates an exemplary implementation of an encoder 710 in a linguistic feature-based TTS model according to an embodiment.
  • the encoder 710 may correspond to the encoder 640 in FIG.6.
  • the encoder 710 may encode a phone feature 702 and a prosody feature 704, wherein the phone feature 702 and the prosody feature 704 may correspond to the phone feature 632 and the prosody feature 634 in FIG.6, respectively.
  • Feature extraction may be performed on the phone feature 702 and the prosody feature 704 through the processing of a 1-D convolution filter 712, a max pooling layer 714, a 1-D convolution projection 716, etc., respectively.
  • the output of the 1-D convolution projection 716 may be added with the phone feature 702 and the prosody feature 704.
  • the added output at 718 may then be processed through a highway network layer 722 and a Bidirectional Long Short Term Memory (BLSTM) layer 724 to obtain an encoder output 712.
  • BLSTM Bidirectional Long Short Term Memory
  • the output of the encoder 640 and the latent representation obtained at 620 may be added.
  • the linguistic feature-based TTS model in FIG.6 may be used for single narrator audio content generation. Therefore, the TTS model may be trained for a target speaker 604.
  • a speaker embedding representation 606 corresponding to the target speaker 604 may be obtained through, e.g., a speaker embedding LUT, and the speaker embedding representation 606 may be used to influence the TTS model to perform speech synthesis with a voice of the target speaker 604.
  • the added output at 642 may be cascaded with the speaker embedding representation 606, and the cascaded output may be provided to an attention module 650.
  • the decoder 660 may generate acoustic features under the influence of the attention module 650, e.g., mel-spectrum features, etc.
  • a vocoder 670 may generate a speech waveform 608 corresponding to the text 602 based on the acoustic features.
  • the implementation 600 in FIG.6 intends to illustrate an exemplary architecture for speech synthesis employing a linguistic feature-based TTS model.
  • At least the training data obtained through the process 400 of FIG.4 may be used for model training.
  • text and speech waveform pairs, as well as labels such as the corresponding character category, style, etc., may be obtained.
  • the input of the speaker embedding representation may be omitted.
  • any component and processing in the implementation 600 are exemplary, and any form of change may be made to the implementation 600 depending on specific requirements and designs.
  • FIG.8 illustrates an exemplary implementation 800 of speech synthesis employing a context-based TTS model according to an embodiment.
  • the implementation 800 may be applied to a scenario of single-narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 200 in FIG.2.
  • the implementation 800 is similar to the implementation 600 in FIG.6, except that a TTS model uses context encoding. Assume that it is desired to generate a speech waveform for a text 802 in FIG.8.
  • a style 812 and a character category 814 corresponding to the text 802 may be predicted through a prediction model 810.
  • the prediction model 810 may be similar to the prediction model 610 in FIG.6.
  • a character category embedding representation 816 corresponding to the character category 814 may be obtained, and at 818, the style 812 and the character category embedding representation 816 may be cascaded to obtain a cascaded representation.
  • a corresponding latent representation may be generated based on the cascaded representation.
  • the latent representation generation at 820 may be similar to the latent representation generation at 620 in FIG.6.
  • a phone feature 832 may be extracted through performing front-end analysis (not shown) on the text 802.
  • the phone feature 832 may be encoded with a phone encoder 830.
  • the phone encoder 830 may be similar to the encoder 640 in FIG.6, except that only the phone feature is used as input.
  • Context information 842 may be extracted from the text 802, and the context information 842 may be encoded with a context encoder 840.
  • the context information 842 may correspond to, e.g., the context 222 in FIG.2, or various information suitable for the context encoder 840 further extracted from the context 222.
  • the context encoder 840 may be any known context encoder that may be used in a TTS model. As an example, an instance of the context encoder 840 is presented in FIG.9.
  • FIG.9 illustrates an exemplary implementation of a context encoder 900 in a context-based TTS model.
  • the context encoder 900 may correspond to the context encoder 840 in FIG.8.
  • the context encoder 900 may perform encoding on the context information 902, which may correspond to the context information 842 in FIG.8.
  • the context encoder 900 may comprise a word encoder 910, which is used to perform encoding on a current text, such as the text 802, to obtain a current semantic feature.
  • the word encoder 910 may comprise, e.g., an embedding layer, an up-sampling layer, an encoding layer, etc., wherein the embedding layer is used to generate a word embedding sequence for a word sequence in the current text, the up- sampling layer is used to up-sample the word embedding sequence to align with a phone sequence of the current text, and the encoding layer is used to encode the up-sampled word embedding sequence into the current semantic feature through, e.g., a convolutional layer, a BLSTM layer, etc.
  • a history text, a future text, a paragraph text, etc. may be extracted from the context information 902.
  • the history text may comprise one or more sentences before the current text
  • the future text may comprise one or more sentences after the current text
  • the paragraph text may comprise all sentences in a paragraph where the current text is located.
  • the context encoder may comprise a history and future encoder 920 for performing encoding on the history text, the future text, and the paragraph text to obtain a history semantic feature, a future semantic feature, and a paragraph semantic feature, respectively.
  • the history and future encoder 920 may comprise, e.g., an embedding layer, an up-sampling layer, a dense layer, an encoding layer, etc., wherein the embedding layer is used to generate a word embedding sequence for a word sequence in an input text, the up-sampling layer is used to up-sample the word embedding sequence to align with a phone sequence of a current text, the dense layer is used to generate a compressed representation for the up-sampled word embedding sequence, and the encoding layer is used to encode the compressed word embedding sequence into the semantic feature corresponding to the input text.
  • the embedding layer is used to generate a word embedding sequence for a word sequence in an input text
  • the up-sampling layer is used to up-sample the word embedding sequence to align with a phone sequence of a current text
  • the dense layer is used to generate a compressed representation for the up-sampled word embedding sequence
  • the encoding layer is
  • the output of the phone encoder 830, the output of the context encoder 840, and the latent representation obtained at 820 may be added.
  • the output of the context encoder 840 may also be provided to an attention module 844.
  • the linguistic feature-based TTS model in FIG.8 may be used for single narrator audio content generation. Therefore, the TTS model may be trained for a target speaker 804. A speaker embedding representation 806 corresponding to the target speaker 804 may be obtained, which is used to influence the TTS model to perform speech synthesis with a voice of the target speaker 804. At 854, the added output at 852 may be cascaded with the speaker embedding representation 806, and the cascaded output may be provided to the attention module 860.
  • the output of the attention module 844 and the output of the attention module 860 may be cascaded to influence generation of acoustic features at a decoder 880.
  • a vocoder 890 may generate a speech waveform 808 corresponding to the text 802 based on the acoustic features.
  • At least the training data obtained through the process 400 of FIG.4 may be used for model training in FIG.8. It should be appreciated that, optionally, in the actual application stage, since the TTS model has been trained to synthesize speech based on the voice of the target speaker, the input of the speaker embedding representation may be omitted. In addition, it should be appreciated that any component and processing in the implementation 800 are exemplary, and any form of change may be made to the implementation 800 depending on specific requirements and designs.
  • FIG.10 illustrates an exemplary process 1000 for predicting a character and selecting a TTS model according to an embodiment.
  • the process 1000 may be performed in a scenario of multi -narrator audio content generation, and it is an exemplary implementation of at least a part of the process 300 in FIG.3.
  • the process 1000 may be used to determine a specific character corresponding to a text, and to select a TTS model trained based on a voice of a speaker corresponding to the character.
  • a text 1002 is from text content 1004.
  • a context corresponding to the text 1002 may be constructed.
  • an embedding representation of the context may be generated through, e.g., a previously-trained language model.
  • a plurality of candidate characters may be extracted from the text content 1004. Assuming that the text content 1004 is a text story book, all characters involved in the text story book may be extracted at 1030 to form a list of candidate characters.
  • the candidate character extraction at 1030 may be performed through any known technique.
  • context-based candidate feature extraction may be performed. For example, for the current text 1002, one or more candidate features may be extracted from the context for each candidate character. Assuming that there are N candidate characters in total, N candidate feature vectors may be obtained at 1040, wherein each candidate feature vector includes candidate features extracted for one candidate character. Various types of features may be extracted at 1040. In an implementation, the extracted features may comprise the number of words spaced between the current text and the candidate character. Since the character's name usually appears near the character's utterance, this feature facilitates to determine whether the current text is spoken by a certain candidate character. In an implementation, the extracted features may comprise the number of occurrences of a candidate character in the context.
  • the extracted feature may comprise a binary feature indicating whether a name of a candidate character appears in the current text. Generally, a character is unlikely to mention his/her name in an utterance he/she speaks. In an implementation, the extracted feature may comprise a binary feature indicating whether a name of a candidate character appears in the closest previous text or the closest subsequent text. Since a alternating speaker mode is often used in a conversation between two characters, a character that speaks a current text is likely to appear in the closest previous text or the closest subsequent text. It should be appreciated that the extracted features may also include any other features that facilitate to determine e.g., the character corresponding to the text 1002.
  • the context embedding representation generated at 1020 may be combined with all candidate feature vectors extracted at 1040 to form a candidate feature matrix corresponding to all candidate characters.
  • a character 1062 corresponding to the text 1002 may be determined from a plurality of candidate characters based at least on the context through a learning-to-rank (LTR) model 1060.
  • the LTR model 1060 may rank multiple candidate characters based on the candidate feature matrix obtained from the context, and determine a highest-ranked candidate character as a character 1062 corresponding to the text 1002.
  • the LTR model 1060 may be constructed using various technologies, e.g., a ranking support vector machine (SVM), a RankNet, an Ordinal classification, etc.
  • the LTR model 1060 may be regarded as a prediction model for predicting characters based on a context, or more broadly, the combination of the LTR model 1060 and the steps 1010, 1020, 1030, 1040, and 1050 may be regarded as a prediction model for predicting characters based on a context.
  • a character personality 1072 of a character corresponding to the text 1002 may be predicted based at least on the context through a character prediction model 1070.
  • the personality prediction model 1070 may predict the character personality 1072 based on the candidate feature matrix obtained from the context.
  • the personality prediction model 1070 may be constructed based on a process similar to the process 500 of FIG.5, except that it is trained for a character classification task with text and character personality training data pair.
  • a TTS model 1090 to be used may be selected from a candidate TTS model library 1082 which is previously prepared.
  • the candidate TTS model library 1082 may comprise a plurality of candidate TTS models previously-trained for different candidate speakers. Each candidate speaker may have attributes in at least one of character category, character personality, character, etc.
  • a candidate speaker corresponding to the text 1002 may be determined with at least one of the character 1062, the character personality 1072, and the character category 1006, and a TTS model 1090 corresponding to the determined candidate speaker may be selected accordingly.
  • the character category 1006 may be determined through, e.g., the process 500 of FIG.5.
  • FIG.11 illustrates an exemplary implementation 1100 of speech synthesis employing a linguistic feature-based TTS model according to an embodiment.
  • the implementation 1100 may be applied to a scenario of multi -narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 300 in FIG.3. Assume that it is desired to generate a speech waveform for a text 1102 in FIG.11.
  • a character personality 1112 corresponding to the text 1102 may be predicted through a prediction model 1110.
  • the prediction model 1110 may correspond to, e.g., the personality prediction model 1070 in FIG.10.
  • a character 1122 corresponding to the text 1102 may be predicted through a prediction model 1120.
  • the prediction model 1120 may correspond to, e.g., the prediction model for predicting a character described above in conjunction with FIG.10.
  • a character category 1132 and a style 1134 corresponding to the text 1102 may be predicted through a prediction model 1130.
  • the prediction model 1130 may be constructed based on, e.g., the process 500 of FIG.5.
  • a latent representation corresponding to the style 1134 may be generated.
  • front-end analysis may be performed on the text 1102 to extract a phone feature 1142 and a prosody feature 1144.
  • the phone feature 1142 and the prosody feature 1144 may be encoded with an encoder 1150.
  • the output of the encoder 1150 and the latent representation obtained at 1136 may be added.
  • the linguistic feature-based TTS model in FIG.11 may be used for multi narrator audio content generation.
  • a candidate speaker 1104 corresponding to the character 1122 may be determined based on at least one of the character 1122, the character personality 1112, and a character category 1132 in a manner similar to the process 1000 of FIG.10.
  • the TTS model may be trained for the candidate speaker 1104.
  • a speaker embedding representation 1106 corresponding to the candidate speaker 1104 may be obtained through, e.g., a speaker embedding LUT, and the speaker embedding representation 1106 may be used to influence the TTS model to perform speech synthesis with a voice of the candidate speaker 1104.
  • the added output at 1152 may be cascaded with the speaker embedding representation 1106, and the cascaded output may be provided to an attention module 1160.
  • a decoder 1170 may generate acoustic features under the influence of the attention module 1160.
  • a vocoder 1180 may generate a speech waveform 1108 corresponding to the text 1102 based on the acoustic features.
  • the implementation 1100 in FIG.11 intends to illustrate an exemplary architecture for speech synthesis employing a linguistic feature-based TTS model.
  • a plurality of candidate TTS models may be obtained.
  • a candidate speaker corresponding to a character may be determined based on at least one of the character, a character personality, and a character category, and a TTS model trained for the candidate speaker may be further selected to generate a speech waveform.
  • any component and processing in the implementation 1100 are exemplary, and any form of change may be made to the implementation 1100 depending on specific requirements and designs.
  • FIG.12 illustrates an exemplary implementation 1200 of speech synthesis employing a context-based TTS model according to an embodiment.
  • the implementation 1200 may be applied to a scenario of multi -narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 300 in FIG.3.
  • the implementation 1200 is similar to the implementation 1100 in FIG.11, except that the TTS model uses context encoding. Assume that it is desired to generate a speech waveform for a text 1202 in FIG.12.
  • a character personality 1212 corresponding to the text 1202 may be predicted through a prediction model 1210.
  • the prediction model 1210 may be similar to the prediction model 1110 in FIG.l 1.
  • a character 1222 corresponding to the text 1202 may be predicted through a prediction model 1220.
  • the prediction model 1220 may be similar to the prediction model 1120 in FIG.l 1.
  • a character category 1232 and a style 1234 corresponding to the text 1202 may be predicted through a prediction model 1230.
  • the prediction model 1230 may be similar to the prediction model 1130 in FIG.11.
  • a latent representation corresponding to the style 1234 may be generated.
  • a phone feature 1242 may be extracted through performing front-end analysis (not shown) on the text 1202.
  • the phone feature 1242 may be encoded with a phone encoder 1240.
  • the phone encoder 1240 may be similar to the phone encoder in FIG.8.
  • Context information 1252 may be extracted from the text 1202, and the context information 1252 may be encoded with a context encoder 1250.
  • the context encoder 1250 may be similar to the context encoder 840 in FIG.8.
  • the output of the phone encoder 1240, the output of the context encoder 1250, and the latent representation obtained at 1236 may be added.
  • the output of the context encoder 1250 may also be provided to an attention module 1254.
  • the context-based TTS model in FIG.12 may be used for multi -narrator audio content generation.
  • a candidate speaker 1204 corresponding to a character 1222 may be determined based on at least one of the character 1222, a character personality 1212, and a character category 1232 in a manner similar to the process 1000 of FIG.10.
  • the TTS model may be trained for the candidate speaker 1204.
  • a speaker embedding representation 1206 corresponding to the candidate speaker 1204 may be used to influence the TTS model to perform speech synthesis with a voice of the candidate speaker 1204.
  • the added output at 1262 may be cascaded with the speaker embedding representation 1206, and the cascaded output may be provided to an attention module 1270.
  • the output of the attention module 1254 and the output of the attention module 1270 may be cascaded to influence generation of acoustic features at a decoder 1280.
  • a vocoder 1290 may generate a speech waveform 1208 corresponding to the text 1202 based on the acoustic features.
  • the implementation 1200 in FIG.12 intends to illustrate an exemplary architecture of speech synthesis employing a context-based TTS model.
  • a candidate speaker corresponding to a character may be determined based on at least one of the character, a character personality, and a character category, and a TTS model trained for the candidate speaker may be further selected to generate a speech waveform.
  • any component and processing in the implementation 1200 are exemplary, and any form of change may be made to the implementation 1200 depending on specific requirements and designs.
  • audio content may also be customized.
  • a speech waveform in generated audio content may be adjusted to update the audio content.
  • FIG.13 illustrates an exemplary process 1300 for updating audio content according to an embodiment.
  • Audio content 1304 corresponding to the text content 1302 may be created through performing audio content generation at 1310.
  • the audio content generation at 1310 may be based on any implementation of the automatic audio content generation according to the embodiments of the present disclosure described above in conjunction with FIGs. 2 to 12.
  • the audio content 1304 may be provided to a customization platform 1320.
  • the customization platform 1320 may comprise a user interface for interacting with a user. Through the user interface, the audio content 1304 may be provided and presented to the user, and adjustment indication 1306 from the user for at least a part of the audio content may be received. For example, if the user is not satisfied with a certain utterance in the audio content 1304 or desires to modify the utterance to be a desired character category, desired style, etc., the user may input the adjustment indication 1306 through the user interface.
  • the adjustment indication 1306 may comprise modification or setting for various parameters involved in speech synthesis.
  • the adjustment indication may comprise adjustment information about prosody information.
  • the prosody information may comprise, e.g., at least one of break, accent, intonation and rate.
  • the user may specify a break before or after a certain word, specify an accent of a certain utterance, change an intonation of a certain word, adjust a rate of a certain utterance, etc.
  • the adjustment indication may comprise adjustment information about pronunciation.
  • the user may specify the correct pronunciation of a certain polyphonic word in current audio content, etc.
  • the adjustment indication may comprise adjustment information about character category.
  • the user may specify a desired character category of "old- aged man” for utterances with the timbre of "middle-aged man”.
  • the adjustment indication may comprise adjustment information about style.
  • the user may specify the desired emotion of "happy” for utterances with emotion of "sad”.
  • the adjustment indication may comprise adjustment information about acoustic parameters.
  • the user may specify a specific acoustic parameter for a certain utterance. It should be appreciated that only a few examples of the adjustment indication 1306 are listed above, and the adjustment indication 1306 may also include modification or setting for any other parameter that can influence speech synthesis.
  • the customization platform 1320 may call a TTS model 1330 to regenerate a speech waveform.
  • a text corresponding to the speech waveform may be provided to the TTS model 1330 along with the adjustment information in the adjustment indication.
  • the TTS model 1330 may then regenerate the speech waveform 1332 of the text conditioned on the adjustment information.
  • a character category specified in the adjustment indication 1306 may be used to replace the character category e.g., designated in FIG.2, and the speech waveform is further generated by the TTS model.
  • the TTS model 1330 may employ the linguistic feature-based TTS model.
  • the previous speech waveform in the audio content 1304 may be replaced with the regenerated speech waveform 1332 to form an updated audio content 1308.
  • the process 1300 may be performed iteratively, thereby realizing continuous adjustment and optimization for the generated audio content. It should be appreciated that any step and processing in the process 1300 are exemplary, and any form of change may be made to the process 1300 depending on specific requirements and designs.
  • FIG.14 illustrates a flowchart of an exemplary method 1400 for automatic audio content generation according to an embodiment.
  • a text may be obtained.
  • reference factors may be determined based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text.
  • a speech waveform corresponding to the text may be generated based at least on the text and the reference factors.
  • the reference factors may further comprise a style corresponding to the text.
  • the determining reference factors may comprise: predicting the character category based at least on the context through a prediction model.
  • the generating a speech waveform may comprise: generating the speech waveform based at least on the text and the character category through a linguistic feature- based TTS model.
  • the linguistic feature-based TTS model may be previously trained for a target speaker.
  • the generating a speech waveform may comprise: generating the speech waveform based at least on the text, the context and the character category through a context-based TTS model.
  • the context-based TTS model may be previously trained for a target speaker.
  • the determining reference factors may comprise: extracting a plurality of candidate characters from a text content containing the text; and determining the character from the plurality of candidate characters based at least on the context through a LTR model.
  • the generating a speech waveform comprises: selecting a TTS model corresponding to the character from a plurality of previously-trained candidate TTS models, the plurality of candidate TTS models being previously trained for different target speakers respectively; and generating the speech waveform through the selected TTS model.
  • the determining reference factors may comprise: predicting the character category based at least on the context through a first prediction model; predicting the character based at least on the context through a second prediction model; and predicting a character personality based at least on the context through a third prediction model.
  • the selecting a TTS model may comprise: selecting the TTS model from the plurality of candidate TTS models based on at least one of the character, the character category and the character personality.
  • the selected TTS model may be a linguistic feature-based TTS model
  • the generating a speech waveform may comprise: generating the speech waveform based at least on the text through the linguistic feature-based TTS model.
  • the selected TTS model may be a context-based TTS model
  • the generating a speech waveform may comprise: generating the speech waveform based at least on the text and the context through the context-based TTS model.
  • the speech waveform may be generated further based on a style corresponding to the text.
  • the method 1400 may further comprise: receiving an adjustment indication for the speech waveform; and in response to the adjustment indication, regenerating a speech waveform corresponding to the text through a linguistic feature-based TTS model.
  • the adjustment indication comprises at least one of: adjustment information about prosody information, the prosody information comprising at least one of break, accent, intonation and rate; adjustment information about pronunciation; adjustment information about character category; adjustment information about style; and adjustment information about acoustic parameters.
  • the method 1400 may further comprise any step/process for automatic audio content generation according to the embodiments of the present disclosure described above.
  • FIG.15 illustrates an exemplary apparatus 1500 for automatic audio content generation according to an embodiment.
  • the apparatus 1500 may comprise: a text obtaining module 1510, for obtaining a text; a context constructing module 1520, for constructing context corresponding to the text; a reference factor determining module 1530, for determining reference factors based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text; and a speech waveform generating module 1540, for generating a speech waveform corresponding to the text based at least on the text and the reference factors.
  • a text obtaining module 1510 for obtaining a text
  • a context constructing module 1520 for constructing context corresponding to the text
  • a reference factor determining module 1530 for determining reference factors based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text
  • a speech waveform generating module 1540 for generating a speech waveform corresponding to the text based at least on the text and the reference factors.
  • the reference factor determining module 1530 may be for: predicting the character category based at least on the context through a prediction model.
  • the speech waveform generating module 1540 may be for: speech generating the speech waveform based at least on the text and the character category through a linguistic feature-based TTS model.
  • the linguistic feature-based TTS model may be previously trained for a target speaker.
  • the speech waveform generating module 1540 may be for: generating the speech waveform based at least on the text, the context and the character category through a context-based TTS model.
  • the context-based TTS model may be previously trained for a target speaker.
  • the reference factor determining module 1530 may be for: extracting a plurality of candidate characters from a text content containing the text; and determining the character from the plurality of candidate characters based at least on the context through a LTR model.
  • the speech waveform generating module 1540 may be for: selecting a TTS model corresponding to the character from a plurality of previously- trained candidate TTS models, the plurality of candidate TTS models being previously trained for different target speakers respectively; and generating the speech waveform through the selected TTS model.
  • the apparatus 1500 may further comprise any other module that performs steps of the method for automatic audio content generation according to the embodiments of the present disclosure described above.
  • FIG.16 illustrates an exemplary apparatus 1600 for automatic audio content generation according to an embodiment.
  • Th apparatus 1600 may comprise: at least one processor 1610; and a memory 1620 storing computer-executable instructions that, when executed, cause the at least one processor 1610 to: obtain a text; construct context corresponding to the text; determine reference factors based at least on the context; the reference factors comprising at least a character category and/or a character corresponding to the text; and generate a speech waveform corresponding to the text based at least on the text and the reference factors.
  • the processor 1610 may further perform any other step/process of the method for automatic audio content generation according to the embodiments of the present disclosure described above.
  • the embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for automatic audio content generation according to the embodiments of the present disclosure as mentioned above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.
  • Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc.
  • the software may reside on a computer-readable medium.
  • a computer-readable medium may comprise, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk.
  • RAM random access memory
  • ROM read only memory
  • PROM programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically erasable PROM
  • register or a removable disk.

Abstract

The present disclosure provides a method and apparatus for automatic audio content generation. A text may be obtained. Context corresponding to the text may be constructed. Reference factors may be determined based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text. A speech waveform corresponding to the text may be generated based at least on the text and the reference factors.

Description

AUTOMATIC AUDIO CONTENT GENERATION
BACKGROUND
[0001] Text-to- Speech (TTS) synthesis aims at generating a corresponding speech waveform based on a text input. A conventional TTS model or system may predict acoustic features based on a textual input and further generate a speech waveform based on the predicted acoustic features. The TTS model may be applied to convert various types of text content into audio content, e.g., to convert books in text format into audiobooks, etc.
SUMMARY
[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
[0003] Embodiments of the present disclosure propose a method and apparatus for automatic audio content generation. A text may be obtained. Context corresponding to the text may be constructed. Reference factors may be determined based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text. A speech waveform corresponding to the text may be generated based at least on the text and the reference factors.
[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG.l illustrates an exemplary conventional TTS model.
[0007] FIG.2 illustrates an exemplary process for automatic audio content generation according to an embodiment.
[0008] FIG.3 illustrates an exemplary process for automatic audio content generation according to an embodiment. [0009] FIG.4 illustrates an exemplary process for preparing training data according to an embodiment.
[0010] FIG.5 illustrates an exemplary process for predicting a character category and a style according to an embodiment.
[0011] FIG.6 illustrates an exemplary implementation of speech synthesis employing a linguistic feature-based TTS model according to an embodiment.
[0012] FIG.7 illustrates an exemplary implementation of an encoder in a linguistic feature-based TTS model according to an embodiment.
[0013] FIG.8 illustrates an exemplary implementation of speech synthesis employing a context-based TTS model according to an embodiment.
[0014] FIG.9 illustrates an exemplary implementation of a context encoder in a context-based TTS model.
[0015] FIG.10 illustrates an exemplary process for predicting a character and selecting a TTS model according to an embodiment.
[0016] FIG.11 illustrates an exemplary implementation of speech synthesis employing a linguistic feature-based TTS model according to an embodiment.
[0017] FIG.12 illustrates an exemplary implementation of speech synthesis employing a context-based TTS model according to an embodiment.
[0018] FIG.13 illustrates an exemplary process for updating audio content according to an embodiment.
[0019] FIG.14 illustrates a flowchart of an exemplary method for automatic audio content generation according to an embodiment.
[0020] FIG.15 illustrates an exemplary apparatus for automatic audio content generation according to an embodiment.
[0021] FIG.16 illustrates an exemplary apparatus for automatic audio content generation according to an embodiment.
DETAILED DESCRIPTION
[0022] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
[0023] Audiobooks are increasingly used for entertainment and education. Traditional audiobooks are recorded manually. For example, a professional narrator or voice actor will tell previously prepared text content, and an audiobook corresponding to the text content is obtained through recording the narrator's narration. This approach of recording an audiobook will be very time-consuming and costly, and further resulting not being able to obtain corresponding audiobooks in time for a large number of text books.
[0024] TTS synthesis may improve the efficiency of audiobook generation and reduce costs. Most TTS models synthesize speech separately for each text sentence. The speech synthesized in this way usually has, e.g., a single utterance prosody, and thus sounds monotonous and boring. When this single utterance prosody is repeatedly applied to the entire chapters and paragraphs of an audiobook, it will significantly reduce the quality of the audiobook. In particular, if TTS synthesis only applies the voice of a single narrator to an entire audiobook, the monotonous speech expression will further reduce the appeal of the audiobook.
[0025] Embodiments of the present disclosure propose to perform automatic and high- quality audio content generation for text content. Herein, text content may broadly refer to any content in text form, e.g., a book, a script, an article, etc., while audio content may broadly refer to any content in audio form, e.g., an audiobook, a video dubbing, news broadcast, etc. Although conversion of a text story book into an audiobook is taken as an example in the various sections discussed below, it should be appreciated that the embodiments of the present disclosure may also be applied to convert text content in any other form into audio content in any other form.
[0026] The embodiments of the present disclosure may construct a context for a text sentence in text content, and use the context for TTS synthesis of the text sentence, instead of only considering the text sentence itself in the TTS synthesis. Generally, context of a text sentence may provide rich expression information about the text sentence, which may be used as reference factors for TTS synthesis, so that the synthesized speech will be more expressive, more vivid, and more diverse. Various reference factors may be determined based on the context, e.g., character, character category, style, character personality, etc. corresponding to the text sentence. Herein, a character may refer to a specific people, an anthropomorphic animal, and an anthropomorphic object, etc., that appears in text content and has the ability to talk. For example, if text content involves a story between two people named "Mike" and "Mary", it may be considered that "Mike" and "Mary" are two characters in the text content. For example, if text content involves a story among a queen, a princess, and a witch, it may be considered that "queen", "princess", and "witch" are the characters in the text content. A character category may refer to a category attribute of a character, e.g., gender, age, etc. A style may refer to an emotion type corresponding to a text sentence, e.g., happy, sad, etc. A character personality may refer to a personality created for a character in text content, e.g., gentle, cheerful, evil, etc. Through considering these reference factors in TTS synthesis, speech with different speech characteristics may be synthesized for different characters, character categories, styles, etc., respectively, e.g., speech with different timbres, voice styles, etc. Thus, the expressiveness, vividness, and diversity of the synthesized speech may be enhanced, thereby significantly improving the quality of the synthesized speech.
[0027] In an aspect, the embodiments of the present disclosure may be applied to a scenario where audio content is synthesized with a voice of a single speaker, which may also be referred to as single-narrator audio content generation. The single speaker may be a pre-designated target speaker, and the voice of the target speaker may be used to simulate or play different types of characters in text content. A character category may be considered in the speech synthesis, so that speech corresponding to different character categories may be generated with the voice of the target speaker, e.g., speech corresponding to young men, speech corresponding to old women, etc. Optionally, a style may also be considered in the speech synthesis, so that different styles of speech may be generated with the voice of the target speaker, e.g., speech corresponding to an emotion type "happy", speech corresponding to an emotion type "sad", etc. Through considering the character category and the style in the single-narrator audio content generation, the expressiveness and vividness of the synthesized speech may be enhanced.
[0028] In an aspect, the embodiments of the present disclosure may be applied to a scenario where audio content is synthesized with voices of multiple speakers, which may also be referred to as multi -narrator audio content generation. The voices of different speakers may be used for different characters in text content. These speakers may be predetermined candidate speakers with different attributes. For a specific character, which speaker's voice to use may be determined with reference to at least a character category, a character personality, etc. For example, assuming that the character Mike is a cheerful young male, voice of a speaker with attributes such as <young>, <male>, and <cheerful>, etc., may be selected to generate Mike's speech. Through automatically assigning voices of different speakers to different characters in the speech synthesis, the diversity of audio content may be enhanced. Optionally, a style may also be considered in the speech synthesis, so that different styles of speech may be generated with a voice of a speaker corresponding to a character, thereby enhancing the expressiveness and vividness, etc., of the synthesized speech.
[0029] The embodiments of the present disclosure may employ various TTS models to synthesize speech with considering the reference factors described above. In an aspect, a linguistic feature-based TTS model may be employed. In the scenario of single-narrator audio content generation, a linguistic feature-based TTS model may be trained with a voice corpus of a target speaker, wherein the model may at least consider a character category, an optional style, etc. to generate speech. In the scenario of multi -narrator audio content generation, different versions of a linguistic feature-based TTS model may be trained with voice corpuses of different candidate speakers, and a corresponding version of the model may be selected for a specific character to generate speech for the character, or furthermore, different styles of speech for the character may be generated through considering styles. In another aspect, a context-based TTS model may be employed. In the scenario of single-narrator audio content generation, a context-based TTS model may be trained with a voice corpus of a target speaker, wherein the model may at least consider a context of a text sentence, character category, optional style, etc. to generate speech. In the scenario of multi -narrator audio content generation, different versions of a context-based TTS model may be trained with voice corpuses of different candidate speakers, and a corresponding version of the model may be selected for a specific character to generate speech for the character, or furthermore, different styles of speech for the character may be generated through considering styles.
[0030] The embodiments of the present disclosure also provide a flexible customization mechanism for audio content. For example, users may adjust or customize audio content through a visual customization platform. Various parameters involved in speech synthesis may be modified or set to adjust any part of audio content, so that a specific utterance in the audio content may have a desired character category, desired style, etc. Since the linguistic feature-based TTS model has explicit feature input, it may be used to update the audio content in response to an adjustment indication of a user.
[0031] The embodiments of the present disclosure may flexibly use a linguistic feature-based TTS model and/or a context-based TTS model to automatically generate high-quality audio content. The linguistic feature-based TTS model may generate high- quality speech through considering reference factors determined based on a context, and may be used to adjust or update the generated audio content. The context-based TTS model considers not only the reference factors determined based on the context in the speech synthesis, but also context features extracted from the context itself, thereby resulting in more coordinated speech synthesis for long text. In the audio content generated according to the embodiments of the present disclosure, utterances of a character will have stronger expressiveness, vividness and diversity, so that appeal and interestingness, etc., of the audio content may be significantly improved. Automatic audio content generation according to the embodiments of the present disclosure is fast and low- cost. In addition, since the embodiments of the present disclosure convert text content into high-quality audio content in a fully automatic manner, the barriers to audio content creation will be further reduced, so that not only professional voice actors but also ordinary users may easily and quickly create their own unique audio content.
[0032] FIG.l illustrates an exemplary conventional TTS model 100.
[0033] The TTS model 100 may be configured to receive a text 102, and generate a speech waveform 108 corresponding to the text 102. The text 102 may also be referred to as a text sentence, which may comprise one or more words, phrases, sentences, paragraphs, etc., and herein, the terms "text" and "text sentence" may be used interchangeably. It should be appreciated that although the text 102 is shown provided to the TTS model 100 in FIG.l, the text 102 may be first converted into an element sequence, such as a phone sequence, a grapheme sequence, a character sequence, etc. which is provided to the TTS model 100 as input. Herein, an input "text" may broadly refer to a word, a phrase, a sentence, etc. included in the text, or an element sequence obtained from the text, e.g., a phone sequence, a grapheme sequence, a character sequence, etc.
[0034] The TTS model 100 may comprise an acoustic model 110. The acoustic model 110 may predict or generate acoustic features 106 according to the text 102. The acoustic features 106 may comprise various TTS acoustic features, e.g., mel-spectrum, linear spectrum pairs (LSP), etc. The acoustic model 110 may be based on various model architectures, e.g., a sequence-to-sequence model architecture, etc. FIG.l illustrates an exemplary sequence-to-sequence acoustic model 110, which may comprise an encoder 112, an attention module 114, and a decoder 116.
[0035] The encoder 112 may convert information contained in the text 102 into a space that is more robust and more suitable for learning alignment with acoustic features. For example, the encoder 112 may convert the information in the text 102 into a state sequence in the space, which may also be referred as an encoder state sequence. Each state in the state sequence corresponds to a phone, grapheme, and character etc. in the text 102. [0036] The attention module 114 may implement an attention mechanism. The attention mechanism establishes a connection between the encoder 112 and the decoder 116, to facilitate aligning between text features output by the encoder 112 and the acoustic features. For example, a connection between each decoding step and the encoder state may be established, and the connection may indicate which encoder state each decoding step should correspond to with what weight. The attention module 114 may take the encoder state sequence and the output of the previous step of the decoder as input, and generate an attention vector that represent weights of the next decoding step to align with various encoder states.
[0037] The decoder 116 may map the state sequence output by the encoder 112 to the acoustic features 106 under the influence of the attention mechanism in the attention module 114. In each decoding step, the decoder 116 may take the attention vector output by the attention module 114 and the output of the previous step of the decoder as input, and output the acoustic features of one or more frames, e.g., the mel-spectrum.
[0038] The TTS model 100 may comprise a vocoder 120. The vocoder 120 may generate a speech waveform 108 based on the acoustic features 106 predicted by the acoustic model 110.
[0039] FIG.2 illustrates an exemplary process 200 for automatic audio content generation according to an embodiment. The process 200 may be applied to a scenario of single-narrator audio content generation.
[0040] Text content 210 is a processed object of automatic audio content generation according to an embodiment, e.g., a textual story book, and it is intended to generate audio content, e.g., an audiobook, through performing the process 200 on a plurality of texts included in the text content 210 respectively. Assume that a text 212 is currently extracted from the text content 210, and it is intended to generate a speech waveform corresponding to the text 212 through performing the process 200.
[0041] At 220, a context 222 corresponding to the text 212 may be constructed. In an implementation, the context 222 may comprise one or more texts adjacent to the text 212 in the text content 210. For example, the context 222 may comprise at least one sentence before the text 212 and/or at least one sentence after the text 212. Therefore, the context 222 is actually a sentence sequence corresponding to the text 212. Optionally, the context 222 may also include more text or all text in the text content 210.
[0042] At 230, reference factors 232 may be determined based at least on the context 222. The reference factors 232 may influence characteristics of synthesized speech in subsequent TTS speech synthesis. The reference factors 232 may comprise a character category corresponding to the text 212, which indicates attributes such as age, gender, etc. of the character corresponding to the text 212. For example, if the text 212 is an utterance spoken by a young male, it may be determined that the character category corresponding to the text 212 is <youth>, <male>, etc. Generally, different character categories may correspond to different speech characteristics. Optionally, the reference factors 232 may also include a style corresponding to the text 212, which indicates, e.g., in which emotion type the text 212 is spoken. For example, if the text 212 is an utterance spoken by a character with angry emotion, it may be determined that the style corresponding to the text 212 is <angry>. Generally, different styles may correspond to different speech characteristics. The character category and the style may individually or jointly influence the characteristics of the synthesized speech. In an implementation, the character category and the style, etc., may be predicted at 230 based on the context 222 through a previously- trained prediction model.
[0043] According to the process 200, the TTS model 240 previously trained for a target speaker may be employed to generate the speech waveform. The target speaker may be a narrator previously automatically determined or a narrator designated by the user. The TTS model 240 may synthesize speech with a voice of the target speaker. In an implementation, the TTS model 240 may be a linguistic feature-based TTS model, and the linguistic feature-based TTS model 240 here may consider at least reference factors to synthesize speech, which is different from a conventional linguistic feature-based TTS model. The linguistic feature-based TTS model 240 may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212 and a character category in the case that the reference factors 232 includes the character category, or may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212, a character category, and a style in the case that the reference factors 232 includes both the character category and the style. In an implementation, the TTS model 240 may be a context-based TTS model, and the context-based TTS model 240 here may consider at least reference factors to synthesize speech, which is different from a conventional context-based TTS model. The context-based TTS model 240 may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212, the context 222 and a character category in the case that the reference factors 232 includes the character category, or may generate a speech waveform 250 corresponding to the text 212 based at least on the text 212, the context 222, a character category, and a style in the case that the reference factors 232 includes both the character category and the style.
[0044] In a similar manner, a plurality of speech waveforms corresponding to a plurality of texts included in the text content 210 may be generated through the process 200. All these speech waveforms may together form audio content corresponding to the text content 210. The audio content may comprise speech of different character categories and/or different styles synthesized with the voice of the target speaker.
[0045] FIG.3 illustrates an exemplary process 300 for automatic audio content generation according to an embodiment. The process 300 may be applied to a scenario of multi -narrator audio content generation.
[0046] Assume that a current text 312 is extracted from text content 310, and it is intended to generate a speech waveform corresponding to the text 312 through performing the process 300.
[0047] At 320, a context 322 corresponding to the text 312 may be constructed. In an implementation, the context 322 may comprise one or more texts adjacent to the text 312 in the text content 310. Optionally, the context 322 may also include more text or all text in the text content 310.
[0048] At 330, reference factors 332 may be determined based at least on the context 322, which is used to influence characteristics of synthesized speech in subsequent TTS speech synthesis. The reference factors 332 may comprise a character category corresponding to the text 312. The reference factors 332 may also comprise a character personality corresponding to the text 312, which indicates a character personality corresponding to the text 312. For example, if the text 312 is an utterance spoken by an evil old witch, it may be determined that the character personality corresponding to the text 312 is <evil>. Generally, different character personalities may correspond to different speech characteristics. The reference factors 332 may also comprise a character corresponding to the text 312, which indicates the text 312 is spoken by which character in the text content 310. Generally, different characters may use different voices. Optionally, the reference factors 332 may also comprise a style corresponding to the text 312. In an implementation, at 330, different reference factors may be predicted based on the context 322 through different previously-trained prediction models. These prediction models may comprise, e.g., a prediction model for predicting a character category and a style, a prediction model for predicting a character personality, a prediction models for predicting a character, etc.
[0049] According to the process 300, at 340, a TTS model to be used may be selected from a candidate TTS model library 350 which is previously prepared. The candidate TTS model library 350 may comprise a plurality of candidate TTS models previously-trained for different candidate speakers. Each candidate speaker may have attributes in at least one of character category, character personality, character, etc. For example, attributes of candidate speaker 1 may comprise <old-aged>, <female>, <evil> and <witch>, attributes of candidate speaker 2 may comprise <middle-aged>, <male> and <cheerful>, etc. A candidate speaker corresponding to the text 312 may be determined with at least one of the character category, the character personality, and the character in the reference factors 332, and a TTS model corresponding to the determined candidate speaker may be selected accordingly.
[0050] Assume that a TTS model 360 is selected from the candidate TTS model library 350 at 340 to be used to generate a speech waveform for the text 312. The TTS model 360 may synthesize speech with a voice of a speaker corresponding to the model. In an implementation, the TTS model 360 may be a linguistic feature-based TTS model, which may generate a speech waveform 370 corresponding to the text 312 based at least on the text 312. In the case that the reference factors 332 include a style, the speech waveform 370 may be generated further based on the style through the linguistic feature- based TTS model 360. In an implementation, the TTS model 360 may be a context-based TTS model, which may generate a speech waveform 370 corresponding to the text 312 based at least on the text 312 and the context 322. In the case that the reference factors 332 include a style, the speech waveform 370 may be generated further based on the style through the context-based TTS model 360.
[0051] In a similar manner, a plurality of speech waveforms corresponding to a plurality of texts included in the text content 310 may be generated through the process 300. All these speech waveforms may together form audio content corresponding to the text content 310. The audio content may comprise speech synthesized with voices of different speakers automatically assigned to different characters, and optionally, the speech may have different styles.
[0052] FIG.4 illustrates an exemplary process 400 for preparing training data according to an embodiment.
[0053] A plurality of sets of matching audio content 402 and text content 404 may be previously obtained, e.g., audiobooks and corresponding text story books.
[0054] At 410, automatic segmentation may be performed on the audio content 402. For example, the audio content 402 may be automatically segmented into a plurality of audio segments, and each audio segment may correspond to one or more speech utterances. The automatic segmentation at 410 may be performed through any known audio segmentation technique.
[0055] At 420, post-processing may be performed on the plurality of segmented audio segments with the text content 404. In an aspect, post-processing at 420 may comprise utterance completion re-segmentation with the text content 404. For example, the text content 404 may be easily segmented into a plurality of text sentences through any known text segmentation technique, and then an audio segment corresponding to each text sentence may be determined with reference to the text sentence. For each text sentence, one or more audio segments obtained at 410 may be segmented or combined to match the text sentence. Accordingly, an audio segment and a text sentence may be aligned in time, and a plurality of pairs of <text sentence, audio segment> may be formed. In another aspect, post-processing at 420 may comprise classifying audio segments into narration and conversation. For example, through performing classification processing for identifying narration and conversation on a text sentence, an audio segment corresponding to the text sentence may be classified into narration and conversation.
[0056] At 430, a label may be added to a <text sentence, audio segment> pair involving a conversation. The label may comprise, e.g., character category, style, etc. In a case, a character category, style, etc. of each <text sentence, audio segment> pair may be determined by means of automatic clustering. In another case, a character category, style, etc. of each <text sentence, audio segment> pair may be labeled by human.
[0057] Through the process 400, a set of labeled training data 406 may be finally obtained. Each piece of training data may have a form of, e.g., <text sentence, audio segment, character category, style>. The training data 406 may be further applied to train a prediction model for predicting a character category and a style corresponding to a text, a TTS model for generating a speech waveform, etc.
[0058] It should be appreciated that the above process 400 is only exemplary, and depending on the specific application scenario and design, any other form of training data may also be prepared in a similar manner. For example, the label added at 430 may also include character, character personality, etc.
[0059] FIG.5 illustrates an exemplary process 500 for predicting a character category and a style according to an embodiment. The process 500 may be performed by a prediction model for predicting a character category and a style. The prediction model may automatically assign a character category and a style to a text.
[0060] For the text 502, a context corresponding to the text 502 may be constructed at 510. The processing at 510 may be similar to the processing at 220 in FIG.2. [0061] The constructed context may be provided to a previously-trained language model 520. The language model 520 is used to model and represent text information, and it may be trained to generate a latent space representation for input text, e.g., an embedding representation. The language model 520 may be based on any appropriate technology, e.g., a Bidirectional Encoder Representations from Transformers (BERT), etc. [0062] The embedding representation output by the language model 520 may be sequentially provided to a projection layer 530 and a softmax layer 540. The projection layer 530 may convert the embedding representation into a projected representation, and the softmax layer 540 may calculate probabilities of different character categories and probabilities of different styles based on the projected representation, thereby finally determining a character category 504 and a style 506 corresponding to the text 502.
[0063] A prediction model for performing the process 500 may be trained with the training data obtained through the process 400 of FIG.4. For example, training data in the form of <text, character category, style> may be used to train the prediction model. In a stage of applying the trained prediction model, the prediction model may predict a character category and a style corresponding to an input text based on the input text.
[0064] Although the above described in connection with the process 500 that the prediction model may jointly predict the character category and the style, it should be appreciated that the character category and the style may be separately predicted by employing separate prediction models through a similar process.
[0065] FIG.6 illustrates an exemplary implementation 600 of speech synthesis employing a linguistic feature-based TTS model according to an embodiment. The implementation 600 may be applied to a scenario of single-narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 200 in FIG.2. Assume that it is desired to generate a speech waveform for a text 602 in FIG.6.
[0066] A style 612 and a character category 614 corresponding to the text 602 may be predicted through a prediction model 610. The prediction model 610 may perform prediction based on, e.g., the process 500 of FIG.5. A character category embedding representation 616 corresponding to the character category 614 may be obtained through, e.g., a character category embedding look-up table (LUT). At 618, the style 612 and the character category embedding representation 616 may be cascaded to obtain a cascaded representation. At 620, a corresponding latent representation may be generated based on the cascaded representation. The latent representation generation at 620 may be performed in various ways, e.g., Gaussian Mixture Variational Auto Encoders (GMVAE), Vector Quantization VAE (VQ-VAE), VAE, Global Style Representation (GST), etc. Taking GMVAE as an example, it may learn a distribution of latent variables, and in a model application stage, it may sample a latent representation from a posterior probability, or directly use a prior mean as a latent representation.
[0067] At 630, front-end analysis may be performed on the text 602 to extract a phone feature 632 and a prosody feature 634. Any known TTS front-end analysis technique may be used to perform the front-end analysis at 630. The phone feature 632 may refer to a phone sequence extracted from the text 602. The prosody feature 634 may refer to prosody information corresponding to the text 602, e.g., break, accent, intonation, rate, etc.
[0068] The phone feature 632 and the prosody feature 634 may be encoded with an encoder 640. The encoder 640 may be based on any architecture. As an example, an instance of the encoder 640 is presented in FIG.7. FIG.7 illustrates an exemplary implementation of an encoder 710 in a linguistic feature-based TTS model according to an embodiment. The encoder 710 may correspond to the encoder 640 in FIG.6. The encoder 710 may encode a phone feature 702 and a prosody feature 704, wherein the phone feature 702 and the prosody feature 704 may correspond to the phone feature 632 and the prosody feature 634 in FIG.6, respectively. Feature extraction may be performed on the phone feature 702 and the prosody feature 704 through the processing of a 1-D convolution filter 712, a max pooling layer 714, a 1-D convolution projection 716, etc., respectively. At 718, the output of the 1-D convolution projection 716 may be added with the phone feature 702 and the prosody feature 704. The added output at 718 may then be processed through a highway network layer 722 and a Bidirectional Long Short Term Memory (BLSTM) layer 724 to obtain an encoder output 712. It should be appreciated that the architecture and all components in FIG.7 are exemplary, and the encoder 710 may also have any other implementation depending on specific requirements and designs.
[0069] According to the process 600, at 642, the output of the encoder 640 and the latent representation obtained at 620 may be added.
[0070] The linguistic feature-based TTS model in FIG.6 may be used for single narrator audio content generation. Therefore, the TTS model may be trained for a target speaker 604. A speaker embedding representation 606 corresponding to the target speaker 604 may be obtained through, e.g., a speaker embedding LUT, and the speaker embedding representation 606 may be used to influence the TTS model to perform speech synthesis with a voice of the target speaker 604. At 644, the added output at 642 may be cascaded with the speaker embedding representation 606, and the cascaded output may be provided to an attention module 650.
[0071] The decoder 660 may generate acoustic features under the influence of the attention module 650, e.g., mel-spectrum features, etc. A vocoder 670 may generate a speech waveform 608 corresponding to the text 602 based on the acoustic features.
[0072] The implementation 600 in FIG.6 intends to illustrate an exemplary architecture for speech synthesis employing a linguistic feature-based TTS model. At least the training data obtained through the process 400 of FIG.4 may be used for model training. For example, through the process 400, text and speech waveform pairs, as well as labels such as the corresponding character category, style, etc., may be obtained. It should be appreciated that, optionally, in the actual application stage, since the TTS model has been trained to synthesize speech based on the voice of the target speaker, the input of the speaker embedding representation may be omitted. In addition, it should be appreciated that any component and processing in the implementation 600 are exemplary, and any form of change may be made to the implementation 600 depending on specific requirements and designs.
[0073] FIG.8 illustrates an exemplary implementation 800 of speech synthesis employing a context-based TTS model according to an embodiment. The implementation 800 may be applied to a scenario of single-narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 200 in FIG.2. The implementation 800 is similar to the implementation 600 in FIG.6, except that a TTS model uses context encoding. Assume that it is desired to generate a speech waveform for a text 802 in FIG.8.
[0074] A style 812 and a character category 814 corresponding to the text 802 may be predicted through a prediction model 810. The prediction model 810 may be similar to the prediction model 610 in FIG.6. A character category embedding representation 816 corresponding to the character category 814 may be obtained, and at 818, the style 812 and the character category embedding representation 816 may be cascaded to obtain a cascaded representation. At 820, a corresponding latent representation may be generated based on the cascaded representation. The latent representation generation at 820 may be similar to the latent representation generation at 620 in FIG.6.
[0075] A phone feature 832 may be extracted through performing front-end analysis (not shown) on the text 802. The phone feature 832 may be encoded with a phone encoder 830. The phone encoder 830 may be similar to the encoder 640 in FIG.6, except that only the phone feature is used as input.
[0076] Context information 842 may be extracted from the text 802, and the context information 842 may be encoded with a context encoder 840. The context information 842 may correspond to, e.g., the context 222 in FIG.2, or various information suitable for the context encoder 840 further extracted from the context 222. The context encoder 840 may be any known context encoder that may be used in a TTS model. As an example, an instance of the context encoder 840 is presented in FIG.9. FIG.9 illustrates an exemplary implementation of a context encoder 900 in a context-based TTS model. The context encoder 900 may correspond to the context encoder 840 in FIG.8. The context encoder 900 may perform encoding on the context information 902, which may correspond to the context information 842 in FIG.8. The context encoder 900 may comprise a word encoder 910, which is used to perform encoding on a current text, such as the text 802, to obtain a current semantic feature. The word encoder 910 may comprise, e.g., an embedding layer, an up-sampling layer, an encoding layer, etc., wherein the embedding layer is used to generate a word embedding sequence for a word sequence in the current text, the up- sampling layer is used to up-sample the word embedding sequence to align with a phone sequence of the current text, and the encoding layer is used to encode the up-sampled word embedding sequence into the current semantic feature through, e.g., a convolutional layer, a BLSTM layer, etc. A history text, a future text, a paragraph text, etc. may be extracted from the context information 902. The history text may comprise one or more sentences before the current text, the future text may comprise one or more sentences after the current text, and the paragraph text may comprise all sentences in a paragraph where the current text is located. The context encoder may comprise a history and future encoder 920 for performing encoding on the history text, the future text, and the paragraph text to obtain a history semantic feature, a future semantic feature, and a paragraph semantic feature, respectively. The history and future encoder 920 may comprise, e.g., an embedding layer, an up-sampling layer, a dense layer, an encoding layer, etc., wherein the embedding layer is used to generate a word embedding sequence for a word sequence in an input text, the up-sampling layer is used to up-sample the word embedding sequence to align with a phone sequence of a current text, the dense layer is used to generate a compressed representation for the up-sampled word embedding sequence, and the encoding layer is used to encode the compressed word embedding sequence into the semantic feature corresponding to the input text. It should be appreciated that although a single history and future encoder is shown in FIG.9, separate encoders may also be provided for the history text, the future text, and the paragraph text to independently generate respective semantic feature. The current semantic feature, the history semantic feature, the future semantic feature, the paragraph semantic feature, etc. may be added at 930 to output a context feature 904.
[0077] According to the process 800, at 852, the output of the phone encoder 830, the output of the context encoder 840, and the latent representation obtained at 820 may be added. In addition, the output of the context encoder 840 may also be provided to an attention module 844.
[0078] The linguistic feature-based TTS model in FIG.8 may be used for single narrator audio content generation. Therefore, the TTS model may be trained for a target speaker 804. A speaker embedding representation 806 corresponding to the target speaker 804 may be obtained, which is used to influence the TTS model to perform speech synthesis with a voice of the target speaker 804. At 854, the added output at 852 may be cascaded with the speaker embedding representation 806, and the cascaded output may be provided to the attention module 860.
[0079] At 870, the output of the attention module 844 and the output of the attention module 860 may be cascaded to influence generation of acoustic features at a decoder 880. A vocoder 890 may generate a speech waveform 808 corresponding to the text 802 based on the acoustic features.
[0080] At least the training data obtained through the process 400 of FIG.4 may be used for model training in FIG.8. It should be appreciated that, optionally, in the actual application stage, since the TTS model has been trained to synthesize speech based on the voice of the target speaker, the input of the speaker embedding representation may be omitted. In addition, it should be appreciated that any component and processing in the implementation 800 are exemplary, and any form of change may be made to the implementation 800 depending on specific requirements and designs.
[0081] FIG.10 illustrates an exemplary process 1000 for predicting a character and selecting a TTS model according to an embodiment. The process 1000 may be performed in a scenario of multi -narrator audio content generation, and it is an exemplary implementation of at least a part of the process 300 in FIG.3. The process 1000 may be used to determine a specific character corresponding to a text, and to select a TTS model trained based on a voice of a speaker corresponding to the character.
[0082] A text 1002 is from text content 1004. At 1010, a context corresponding to the text 1002 may be constructed. At 1020, an embedding representation of the context may be generated through, e.g., a previously-trained language model.
[0083] At 1030, a plurality of candidate characters may be extracted from the text content 1004. Assuming that the text content 1004 is a text story book, all characters involved in the text story book may be extracted at 1030 to form a list of candidate characters. The candidate character extraction at 1030 may be performed through any known technique.
[0084] At 1040, context-based candidate feature extraction may be performed. For example, for the current text 1002, one or more candidate features may be extracted from the context for each candidate character. Assuming that there are N candidate characters in total, N candidate feature vectors may be obtained at 1040, wherein each candidate feature vector includes candidate features extracted for one candidate character. Various types of features may be extracted at 1040. In an implementation, the extracted features may comprise the number of words spaced between the current text and the candidate character. Since the character's name usually appears near the character's utterance, this feature facilitates to determine whether the current text is spoken by a certain candidate character. In an implementation, the extracted features may comprise the number of occurrences of a candidate character in the context. This feature may reflect a relative importance of a specific candidate character in the text content. In an implementation, the extracted feature may comprise a binary feature indicating whether a name of a candidate character appears in the current text. Generally, a character is unlikely to mention his/her name in an utterance he/she speaks. In an implementation, the extracted feature may comprise a binary feature indicating whether a name of a candidate character appears in the closest previous text or the closest subsequent text. Since a alternating speaker mode is often used in a conversation between two characters, a character that speaks a current text is likely to appear in the closest previous text or the closest subsequent text. It should be appreciated that the extracted features may also include any other features that facilitate to determine e.g., the character corresponding to the text 1002.
[0085] At 1050, the context embedding representation generated at 1020 may be combined with all candidate feature vectors extracted at 1040 to form a candidate feature matrix corresponding to all candidate characters.
[0086] According to the process 1000, a character 1062 corresponding to the text 1002 may be determined from a plurality of candidate characters based at least on the context through a learning-to-rank (LTR) model 1060. For example, the LTR model 1060 may rank multiple candidate characters based on the candidate feature matrix obtained from the context, and determine a highest-ranked candidate character as a character 1062 corresponding to the text 1002. The LTR model 1060 may be constructed using various technologies, e.g., a ranking support vector machine (SVM), a RankNet, an Ordinal classification, etc. It should be appreciated that herein, the LTR model 1060 may be regarded as a prediction model for predicting characters based on a context, or more broadly, the combination of the LTR model 1060 and the steps 1010, 1020, 1030, 1040, and 1050 may be regarded as a prediction model for predicting characters based on a context.
[0087] According to the process 1000, optionally, a character personality 1072 of a character corresponding to the text 1002 may be predicted based at least on the context through a character prediction model 1070. For example, the personality prediction model 1070 may predict the character personality 1072 based on the candidate feature matrix obtained from the context. The personality prediction model 1070 may be constructed based on a process similar to the process 500 of FIG.5, except that it is trained for a character classification task with text and character personality training data pair.
[0088] According to the process 1000, at 1080, a TTS model 1090 to be used may be selected from a candidate TTS model library 1082 which is previously prepared. The candidate TTS model library 1082 may comprise a plurality of candidate TTS models previously-trained for different candidate speakers. Each candidate speaker may have attributes in at least one of character category, character personality, character, etc. At 1080, a candidate speaker corresponding to the text 1002 may be determined with at least one of the character 1062, the character personality 1072, and the character category 1006, and a TTS model 1090 corresponding to the determined candidate speaker may be selected accordingly. The character category 1006 may be determined through, e.g., the process 500 of FIG.5.
[0089] It should be appreciated that any step and processing in the process 1000 are exemplary, and any form of change may be made to the process 1000 depending on specific requirements and designs.
[0090] FIG.11 illustrates an exemplary implementation 1100 of speech synthesis employing a linguistic feature-based TTS model according to an embodiment. The implementation 1100 may be applied to a scenario of multi -narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 300 in FIG.3. Assume that it is desired to generate a speech waveform for a text 1102 in FIG.11. [0091] A character personality 1112 corresponding to the text 1102 may be predicted through a prediction model 1110. The prediction model 1110 may correspond to, e.g., the personality prediction model 1070 in FIG.10. A character 1122 corresponding to the text 1102 may be predicted through a prediction model 1120. The prediction model 1120 may correspond to, e.g., the prediction model for predicting a character described above in conjunction with FIG.10. A character category 1132 and a style 1134 corresponding to the text 1102 may be predicted through a prediction model 1130. The prediction model 1130 may be constructed based on, e.g., the process 500 of FIG.5. At 1136, a latent representation corresponding to the style 1134 may be generated. At 1140, front-end analysis may be performed on the text 1102 to extract a phone feature 1142 and a prosody feature 1144. The phone feature 1142 and the prosody feature 1144 may be encoded with an encoder 1150. At 1152, the output of the encoder 1150 and the latent representation obtained at 1136 may be added.
[0092] The linguistic feature-based TTS model in FIG.11 may be used for multi narrator audio content generation. A candidate speaker 1104 corresponding to the character 1122 may be determined based on at least one of the character 1122, the character personality 1112, and a character category 1132 in a manner similar to the process 1000 of FIG.10. The TTS model may be trained for the candidate speaker 1104. A speaker embedding representation 1106 corresponding to the candidate speaker 1104 may be obtained through, e.g., a speaker embedding LUT, and the speaker embedding representation 1106 may be used to influence the TTS model to perform speech synthesis with a voice of the candidate speaker 1104. At 1154, the added output at 1152 may be cascaded with the speaker embedding representation 1106, and the cascaded output may be provided to an attention module 1160.
[0093] A decoder 1170 may generate acoustic features under the influence of the attention module 1160. A vocoder 1180 may generate a speech waveform 1108 corresponding to the text 1102 based on the acoustic features.
[0094] The implementation 1100 in FIG.11 intends to illustrate an exemplary architecture for speech synthesis employing a linguistic feature-based TTS model. Through constructing corresponding TTS models for different candidate speakers, a plurality of candidate TTS models may be obtained. In the actual application stage, a candidate speaker corresponding to a character may be determined based on at least one of the character, a character personality, and a character category, and a TTS model trained for the candidate speaker may be further selected to generate a speech waveform. In addition, it should be appreciated that any component and processing in the implementation 1100 are exemplary, and any form of change may be made to the implementation 1100 depending on specific requirements and designs.
[0095] FIG.12 illustrates an exemplary implementation 1200 of speech synthesis employing a context-based TTS model according to an embodiment. The implementation 1200 may be applied to a scenario of multi -narrator audio content generation, and it may be regarded as an exemplary specific implementation of the process 300 in FIG.3. The implementation 1200 is similar to the implementation 1100 in FIG.11, except that the TTS model uses context encoding. Assume that it is desired to generate a speech waveform for a text 1202 in FIG.12.
[0096] A character personality 1212 corresponding to the text 1202 may be predicted through a prediction model 1210. The prediction model 1210 may be similar to the prediction model 1110 in FIG.l 1. A character 1222 corresponding to the text 1202 may be predicted through a prediction model 1220. The prediction model 1220 may be similar to the prediction model 1120 in FIG.l 1. A character category 1232 and a style 1234 corresponding to the text 1202 may be predicted through a prediction model 1230. The prediction model 1230 may be similar to the prediction model 1130 in FIG.11. At 1236, a latent representation corresponding to the style 1234 may be generated.
[0097] A phone feature 1242 may be extracted through performing front-end analysis (not shown) on the text 1202. The phone feature 1242 may be encoded with a phone encoder 1240. The phone encoder 1240 may be similar to the phone encoder in FIG.8. Context information 1252 may be extracted from the text 1202, and the context information 1252 may be encoded with a context encoder 1250. The context encoder 1250 may be similar to the context encoder 840 in FIG.8.
[0098] At 1262, the output of the phone encoder 1240, the output of the context encoder 1250, and the latent representation obtained at 1236 may be added. In addition, the output of the context encoder 1250 may also be provided to an attention module 1254. [0099] The context-based TTS model in FIG.12 may be used for multi -narrator audio content generation. A candidate speaker 1204 corresponding to a character 1222 may be determined based on at least one of the character 1222, a character personality 1212, and a character category 1232 in a manner similar to the process 1000 of FIG.10. The TTS model may be trained for the candidate speaker 1204. A speaker embedding representation 1206 corresponding to the candidate speaker 1204 may be used to influence the TTS model to perform speech synthesis with a voice of the candidate speaker 1204. At 1264, the added output at 1262 may be cascaded with the speaker embedding representation 1206, and the cascaded output may be provided to an attention module 1270.
[00100] At 1272, the output of the attention module 1254 and the output of the attention module 1270 may be cascaded to influence generation of acoustic features at a decoder 1280. A vocoder 1290 may generate a speech waveform 1208 corresponding to the text 1202 based on the acoustic features.
[00101] The implementation 1200 in FIG.12 intends to illustrate an exemplary architecture of speech synthesis employing a context-based TTS model. Through constructing corresponding TTS models for different candidate speakers, a plurality of candidate TTS models may be obtained. In the actual application stage, a candidate speaker corresponding to a character may be determined based on at least one of the character, a character personality, and a character category, and a TTS model trained for the candidate speaker may be further selected to generate a speech waveform. In addition, it should be appreciated that any component and processing in the implementation 1200 are exemplary, and any form of change may be made to the implementation 1200 depending on specific requirements and designs.
[00102] According to the embodiments of the present disclosure, audio content may also be customized. For example, a speech waveform in generated audio content may be adjusted to update the audio content.
[00103] FIG.13 illustrates an exemplary process 1300 for updating audio content according to an embodiment.
[00104] Assume that a user provides text content 1302 and desires to obtain audio content corresponding to the text content 1302. Audio content 1304 corresponding to the text content 1302 may be created through performing audio content generation at 1310. The audio content generation at 1310 may be based on any implementation of the automatic audio content generation according to the embodiments of the present disclosure described above in conjunction with FIGs. 2 to 12.
[00105] The audio content 1304 may be provided to a customization platform 1320. The customization platform 1320 may comprise a user interface for interacting with a user. Through the user interface, the audio content 1304 may be provided and presented to the user, and adjustment indication 1306 from the user for at least a part of the audio content may be received. For example, if the user is not satisfied with a certain utterance in the audio content 1304 or desires to modify the utterance to be a desired character category, desired style, etc., the user may input the adjustment indication 1306 through the user interface.
[00106] The adjustment indication 1306 may comprise modification or setting for various parameters involved in speech synthesis. In an implementation, the adjustment indication may comprise adjustment information about prosody information. The prosody information may comprise, e.g., at least one of break, accent, intonation and rate. For example, the user may specify a break before or after a certain word, specify an accent of a certain utterance, change an intonation of a certain word, adjust a rate of a certain utterance, etc. In an implementation, the adjustment indication may comprise adjustment information about pronunciation. For example, the user may specify the correct pronunciation of a certain polyphonic word in current audio content, etc. In an implementation, the adjustment indication may comprise adjustment information about character category. For example, the user may specify a desired character category of "old- aged man" for utterances with the timbre of "middle-aged man". In an implementation, the adjustment indication may comprise adjustment information about style. For example, the user may specify the desired emotion of "happy" for utterances with emotion of "sad". In an implementation, the adjustment indication may comprise adjustment information about acoustic parameters. For example, the user may specify a specific acoustic parameter for a certain utterance. It should be appreciated that only a few examples of the adjustment indication 1306 are listed above, and the adjustment indication 1306 may also include modification or setting for any other parameter that can influence speech synthesis.
[00107] According to the process 1300, in response to the adjustment indication 1306, the customization platform 1320 may call a TTS model 1330 to regenerate a speech waveform. Assuming that the adjustment indication 1306 is for a certain utterance or a corresponding speech waveform in the audio content 1304, a text corresponding to the speech waveform may be provided to the TTS model 1330 along with the adjustment information in the adjustment indication. The TTS model 1330 may then regenerate the speech waveform 1332 of the text conditioned on the adjustment information. Taking the adjustment indication 1306 including adjustment information about character category as an example, a character category specified in the adjustment indication 1306 may be used to replace the character category e.g., designated in FIG.2, and the speech waveform is further generated by the TTS model. In an implementation, since a linguistic feature-based TTS model has an explicit feature input, which may be controlled by parameters corresponding to an adjustment indication, the TTS model 1330 may employ the linguistic feature-based TTS model. [00108] The previous speech waveform in the audio content 1304 may be replaced with the regenerated speech waveform 1332 to form an updated audio content 1308.
[00109] The process 1300 may be performed iteratively, thereby realizing continuous adjustment and optimization for the generated audio content. It should be appreciated that any step and processing in the process 1300 are exemplary, and any form of change may be made to the process 1300 depending on specific requirements and designs.
[00110] FIG.14 illustrates a flowchart of an exemplary method 1400 for automatic audio content generation according to an embodiment.
[00111] At 1410, a text may be obtained.
[00112] At 1420, context corresponding to the text may be constructed.
[00113] At 1430, reference factors may be determined based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text.
[00114] At 1440, a speech waveform corresponding to the text may be generated based at least on the text and the reference factors.
[00115] In an implementation, the reference factors may further comprise a style corresponding to the text.
[00116] In an implementation, the determining reference factors may comprise: predicting the character category based at least on the context through a prediction model. [00117] The generating a speech waveform may comprise: generating the speech waveform based at least on the text and the character category through a linguistic feature- based TTS model. The linguistic feature-based TTS model may be previously trained for a target speaker.
[00118] The generating a speech waveform may comprise: generating the speech waveform based at least on the text, the context and the character category through a context-based TTS model. The context-based TTS model may be previously trained for a target speaker.
[00119] In an implementation, the determining reference factors may comprise: extracting a plurality of candidate characters from a text content containing the text; and determining the character from the plurality of candidate characters based at least on the context through a LTR model.
[00120] In an implementation, the generating a speech waveform comprises: selecting a TTS model corresponding to the character from a plurality of previously-trained candidate TTS models, the plurality of candidate TTS models being previously trained for different target speakers respectively; and generating the speech waveform through the selected TTS model.
[00121] The determining reference factors may comprise: predicting the character category based at least on the context through a first prediction model; predicting the character based at least on the context through a second prediction model; and predicting a character personality based at least on the context through a third prediction model. The selecting a TTS model may comprise: selecting the TTS model from the plurality of candidate TTS models based on at least one of the character, the character category and the character personality.
[00122] The selected TTS model may be a linguistic feature-based TTS model, and the generating a speech waveform may comprise: generating the speech waveform based at least on the text through the linguistic feature-based TTS model.
[00123] The selected TTS model may be a context-based TTS model, and the generating a speech waveform may comprise: generating the speech waveform based at least on the text and the context through the context-based TTS model.
[00124] In an implementation, the speech waveform may be generated further based on a style corresponding to the text.
[00125] In an implementation, the method 1400 may further comprise: receiving an adjustment indication for the speech waveform; and in response to the adjustment indication, regenerating a speech waveform corresponding to the text through a linguistic feature-based TTS model.
[00126] The adjustment indication comprises at least one of: adjustment information about prosody information, the prosody information comprising at least one of break, accent, intonation and rate; adjustment information about pronunciation; adjustment information about character category; adjustment information about style; and adjustment information about acoustic parameters.
[00127] It should be appreciated that the method 1400 may further comprise any step/process for automatic audio content generation according to the embodiments of the present disclosure described above.
[00128] FIG.15 illustrates an exemplary apparatus 1500 for automatic audio content generation according to an embodiment.
[00129] The apparatus 1500 may comprise: a text obtaining module 1510, for obtaining a text; a context constructing module 1520, for constructing context corresponding to the text; a reference factor determining module 1530, for determining reference factors based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text; and a speech waveform generating module 1540, for generating a speech waveform corresponding to the text based at least on the text and the reference factors.
[00130] In an implementation, the reference factor determining module 1530 may be for: predicting the character category based at least on the context through a prediction model.
[00131] The speech waveform generating module 1540 may be for: speech generating the speech waveform based at least on the text and the character category through a linguistic feature-based TTS model. The linguistic feature-based TTS model may be previously trained for a target speaker.
[00132] The speech waveform generating module 1540 may be for: generating the speech waveform based at least on the text, the context and the character category through a context-based TTS model. The context-based TTS model may be previously trained for a target speaker.
[00133] In an implementation, the reference factor determining module 1530 may be for: extracting a plurality of candidate characters from a text content containing the text; and determining the character from the plurality of candidate characters based at least on the context through a LTR model.
[00134] In an implementation, the speech waveform generating module 1540 may be for: selecting a TTS model corresponding to the character from a plurality of previously- trained candidate TTS models, the plurality of candidate TTS models being previously trained for different target speakers respectively; and generating the speech waveform through the selected TTS model.
[00135] In addition, the apparatus 1500 may further comprise any other module that performs steps of the method for automatic audio content generation according to the embodiments of the present disclosure described above.
[00136] FIG.16 illustrates an exemplary apparatus 1600 for automatic audio content generation according to an embodiment.
[00137] Th apparatus 1600 may comprise: at least one processor 1610; and a memory 1620 storing computer-executable instructions that, when executed, cause the at least one processor 1610 to: obtain a text; construct context corresponding to the text; determine reference factors based at least on the context; the reference factors comprising at least a character category and/or a character corresponding to the text; and generate a speech waveform corresponding to the text based at least on the text and the reference factors. In addition, the processor 1610 may further perform any other step/process of the method for automatic audio content generation according to the embodiments of the present disclosure described above.
[00138] The embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for automatic audio content generation according to the embodiments of the present disclosure as mentioned above.
[00139] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.
[00140] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
[00141] Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.
[00142] Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may comprise, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).
[00143] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for automatic audio content generation, comprising: obtaining a text; constructing context corresponding to the text; determining reference factors based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text; and generating a speech waveform corresponding to the text based at least on the text and the reference factors.
2. The method of claim 1, wherein the reference factors further comprise a style corresponding to the text.
3. The method of claim 1, wherein the determining reference factors comprises: predicting the character category based at least on the context through a prediction model.
4. The method of claim 3, wherein the generating a speech waveform comprises: generating the speech waveform based at least on the text and the character category through a linguistic feature-based text-to-speech (TTS) model, wherein the linguistic feature-based TTS model is previously trained for a target speaker.
5. The method of claim 3, wherein the generating a speech waveform comprises: generating the speech waveform based at least on the text, the context and the character category through a context-based text-to-speech (TTS) model, wherein the context-based TTS model is previously trained for a target speaker.
6. The method of claim 1, wherein the determining reference factors comprises: extracting a plurality of candidate characters from a text content containing the text; and determining the character from the plurality of candidate characters based at least on the context through a learning-to-rank (LTR) model.
7. The method of claim 1, wherein the generating a speech waveform comprises: selecting a text-to-speech (TTS) model corresponding to the character from a plurality of previously-trained candidate TTS models, the plurality of candidate TTS models being previously trained for different target speakers respectively; and generating the speech waveform through the selected TTS model.
8. The method of claim 7, wherein the determining reference factors comprises: predicting the character category based at least on the context through a first prediction model; predicting the character based at least on the context through a second prediction model; and predicting a character personality based at least on the context through a third prediction model, and wherein the selecting a TTS model comprises: selecting the TTS model from the plurality of candidate TTS models based on at least one of the character, the character category and the character personality.
9. The method of claim 7, wherein the selected TTS model is a linguistic feature- based TTS model, and the generating a speech waveform comprises: generating the speech waveform based at least on the text through the linguistic feature-based TTS model.
10. The method of claim 7, wherein the selected TTS model is a context-based TTS model, and the generating a speech waveform comprises: generating the speech waveform based at least on the text and the context through the context-based TTS model.
11. The method of any one of claims 4, 5, 9 and 10, wherein the speech waveform is generated further based on a style corresponding to the text.
12. The method of claim 1, further comprising: receiving an adjustment indication for the speech waveform; and in response to the adjustment indication, regenerating a speech waveform corresponding to the text through a linguistic feature-based text-to-speech (TTS) model.
13. The method of claim 12, wherein the adjustment indication comprises at least one of: adjustment information about prosody information, the prosody information comprising at least one of break, accent, intonation and rate; adjustment information about pronunciation; adjustment information about character category; adjustment information about style; and adjustment information about acoustic parameters.
14. An apparatus for automatic audio content generation, comprising: a text obtaining module, for obtaining a text; a context constructing module, for constructing context corresponding to the text; a reference factor determining module, for determining reference factors based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text; and a speech waveform generating module, for generating a speech waveform corresponding to the text based at least on the text and the reference factors.
15. An apparatus for automatic audio content generation, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: obtain a text, construct context corresponding to the text, determine reference factors based at least on the context, the reference factors comprising at least a character category and/or a character corresponding to the text, and generate a speech waveform corresponding to the text based at least on the text and the reference factors.
PCT/US2021/028297 2020-05-09 2021-04-21 Automatic audio content generation WO2021231050A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010387249.8 2020-05-09
CN202010387249.8A CN113628609A (en) 2020-05-09 2020-05-09 Automatic audio content generation

Publications (1)

Publication Number Publication Date
WO2021231050A1 true WO2021231050A1 (en) 2021-11-18

Family

ID=75870784

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/028297 WO2021231050A1 (en) 2020-05-09 2021-04-21 Automatic audio content generation

Country Status (2)

Country Link
CN (1) CN113628609A (en)
WO (1) WO2021231050A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620699B (en) * 2022-12-19 2023-03-31 深圳元象信息科技有限公司 Speech synthesis method, speech synthesis system, speech synthesis apparatus, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279347A1 (en) * 2014-03-27 2015-10-01 International Business Machines Corporation Text-to-Speech for Digital Literature
US20190043474A1 (en) * 2017-08-07 2019-02-07 Lenovo (Singapore) Pte. Ltd. Generating audio rendering from textual content based on character models
WO2020018724A1 (en) * 2018-07-19 2020-01-23 Dolby International Ab Method and system for creating object-based audio content

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8219398B2 (en) * 2005-03-28 2012-07-10 Lessac Technologies, Inc. Computerized speech synthesizer for synthesizing speech from text
CN106652995A (en) * 2016-12-31 2017-05-10 深圳市优必选科技有限公司 Voice broadcasting method and system for text
CN110491365A (en) * 2018-05-10 2019-11-22 微软技术许可有限责任公司 Audio is generated for plain text document
TWI685835B (en) * 2018-10-26 2020-02-21 財團法人資訊工業策進會 Audio playback device and audio playback method thereof
CN110634336A (en) * 2019-08-22 2019-12-31 北京达佳互联信息技术有限公司 Method and device for generating audio electronic book

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279347A1 (en) * 2014-03-27 2015-10-01 International Business Machines Corporation Text-to-Speech for Digital Literature
US20190043474A1 (en) * 2017-08-07 2019-02-07 Lenovo (Singapore) Pte. Ltd. Generating audio rendering from textual content based on character models
WO2020018724A1 (en) * 2018-07-19 2020-01-23 Dolby International Ab Method and system for creating object-based audio content

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ELSON DAVID K ET AL: "Automatic Attribution of Quoted Speech in Literary Narrative. Natural Language Processing View project", PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI 2010, 11 July 2010 (2010-07-11), pages 1013 - 1019, XP055822475, Retrieved from the Internet <URL:https://www.researchgate.net/publication/221605789_Automatic_Attribution_of_Quoted_Speech_in_Literary_Narrative> [retrieved on 20210708] *
TIE-YAN LIU ET AL: "Introduction to special issue on learning to rank for information retrieval", INFORMATION RETRIEVAL, KLUWER ACADEMIC PUBLISHERS, BO, vol. 13, no. 3, 20 December 2009 (2009-12-20), pages 197 - 200, XP019820210, ISSN: 1573-7659 *

Also Published As

Publication number Publication date
CN113628609A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
JP7142333B2 (en) Multilingual Text-to-Speech Synthesis Method
US11929059B2 (en) Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN111587455B (en) Text-to-speech method and apparatus using machine learning and computer-readable storage medium
US20230043916A1 (en) Text-to-speech processing using input voice characteristic data
US20220013106A1 (en) Multi-speaker neural text-to-speech synthesis
US10475438B1 (en) Contextual text-to-speech processing
CN108899009B (en) Chinese speech synthesis system based on phoneme
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
US11763797B2 (en) Text-to-speech (TTS) processing
CN106463113A (en) Predicting pronunciation in speech recognition
KR20230043084A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
CN111627420A (en) Specific-speaker emotion voice synthesis method and device under extremely low resources
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
WO2021231050A1 (en) Automatic audio content generation
CN116312471A (en) Voice migration and voice interaction method and device, electronic equipment and storage medium
CN114708848A (en) Method and device for acquiring size of audio and video file
Jin Speech synthesis for text-based editing of audio narration
CN114333758A (en) Speech synthesis method, apparatus, computer device, storage medium and product
Fong et al. Improving polyglot speech synthesis through multi-task and adversarial learning
CN114283781A (en) Speech synthesis method and related device, electronic equipment and storage medium
Wu A Neural Framework to Model Expressivity for Text-to-Speech Synthesis
Oh et al. DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation
Belonozhko et al. Features of the Implementation of Real-Time Text-to-Speech Systems With Data Variability
CN117854474A (en) Speech data set synthesis method and system with expressive force and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21724483

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21724483

Country of ref document: EP

Kind code of ref document: A1