CN113628609A

CN113628609A - Automatic audio content generation

Info

Publication number: CN113628609A
Application number: CN202010387249.8A
Authority: CN
Inventors: 汪曦; 张少飞; 肖雨佳; 刘越颖; 何磊
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2021-11-09
Also published as: WO2021231050A1

Abstract

The present disclosure provides methods and apparatus for automatic audio content generation. Text may be obtained. A context corresponding to the text may be constructed. Reference factors can be determined based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text. A speech waveform corresponding to the text may be generated based at least on the text and the reference factors.

Description

Automatic audio content generation

Background

Text-to-speech (TTS) synthesis aims at generating corresponding speech waveforms based on text input. Conventional TTS models or systems may predict acoustic features based on text input and, in turn, generate speech waveforms based on the predicted acoustic features. The TTS model may be applied to convert various types of text contents into audio contents, for example, converting a book in a text format into an audio book (audiobook), and the like.

Disclosure of Invention

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure propose methods and apparatuses for automatic audio content generation. Text may be obtained. A context corresponding to the text may be constructed. Reference factors can be determined based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text. A speech waveform corresponding to the text may be generated based at least on the text and the reference factors.

It should be noted that one or more of the above aspects include features that are specifically pointed out in the following detailed description and claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.

Drawings

The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate, but not to limit, the disclosed aspects.

Fig. 1 shows an exemplary conventional TTS model.

Fig. 2 illustrates an exemplary process of automatic audio content generation according to an embodiment.

Fig. 3 illustrates an exemplary process of automatic audio content generation according to an embodiment.

Fig. 4 shows an exemplary process of preparing training data according to an embodiment.

Fig. 5 illustrates an exemplary process of predicting role categories and styles, according to an embodiment.

FIG. 6 illustrates an exemplary implementation of speech synthesis using a TTS model based on language features, according to an embodiment.

Fig. 7 shows an exemplary implementation of an encoder in a TTS model based on language features according to an embodiment.

FIG. 8 illustrates an exemplary implementation of speech synthesis using a context-based TTS model, according to an embodiment.

Fig. 9 shows an exemplary implementation of a context encoder in a context-based TTS model.

Fig. 10 illustrates an exemplary process of predicting roles and selecting a TTS model according to an embodiment.

FIG. 11 illustrates an exemplary implementation of speech synthesis using a TTS model based on language features, according to an embodiment.

FIG. 12 illustrates an exemplary implementation of speech synthesis using a context-based TTS model, according to embodiments.

Fig. 13 illustrates an exemplary process of updating audio content according to an embodiment.

Fig. 14 shows a flow of an exemplary method for automatic audio content generation, according to an embodiment.

Fig. 15 illustrates an exemplary apparatus for automatic audio content generation, according to an embodiment.

Fig. 16 illustrates an exemplary apparatus for automatic audio content generation, according to an embodiment.

Detailed Description

The present disclosure will now be discussed with reference to various exemplary embodiments. It is to be understood that the discussion of these embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and does not teach any limitation as to the scope of the present disclosure.

Audio books are increasingly commonly used for entertainment and education. Traditional audio books are recorded manually. For example, a professional speaker (narrator) or a dubbing actor (voice actor) speaks a text content prepared in advance, and a vocal book corresponding to the text content is obtained by recording the lecture of the speaker. Such a recording of an audio book would be very time consuming and costly, resulting in an inability to obtain a corresponding audio book for a large number of text books in a timely manner.

TTS synthesis can improve the efficiency of production of voiced books and reduce costs. Most TTS models synthesize speech separately for each text sentence. Speech synthesized in this way typically has, for example, a single prosody of the utterance and is thus tedious to sound. When such a single prosody of speech is repeatedly applied to the entire passage of a voiced book, the quality of the voiced book will be significantly reduced. In particular, if TTS synthesis is used only for a single speaker's voice for the entire voiced book, a monotonous phonetic representation will further reduce the appeal of the voiced book.

Embodiments of the present disclosure propose performing automatic and high quality audio content generation for textual content. In this context, text content may broadly refer to any content in text form, such as books, scripts, articles, etc., while audio content may broadly refer to any content in audio form, such as audio books, dubbing of videos, news broadcasts, etc. Although the conversion of a text storybook into an audio book is exemplified in various portions of the following discussion, it should be understood that embodiments of the present disclosure can also be applied to the conversion of any other form of text content into any other form of audio content.

Embodiments of the present disclosure may build a context for one text sentence in the text content and use the context for TTS synthesis of the text sentence, rather than just considering the text sentence itself in the TTS synthesis. Generally, the context of a text sentence can provide rich expression information about the text sentence, which can be used as a reference factor for TTS synthesis, so that the synthesized speech is more expressive, more vivid, more diverse, and the like. Various reference factors, such as a role, a role category, a style, a role character, etc., corresponding to the text statement may be determined based on the context. In this context, a character may refer to a specific character, anthropomorphic animal, anthropomorphic object, etc. having conversation capability that appears in text content. For example, assuming that the textual content relates to a story that occurs between two people named "Mike" and "Mary," it can be considered that "Mike" and "Mary" are two characters in the textual content. For example, assuming that the textual content relates to a story occurring between queen, princess, and witch, "queen," "princess," and "witch" may be considered characters in the textual content. The character category may refer to a category attribute of the character, such as gender, age, and the like. Style may refer to an emotional type, e.g., happy, sad, etc., to which the text statement corresponds. Character personality may refer to a personality that is modeled for a character in the text content, e.g., gentle, clear, nefarious, etc. By taking these reference factors into account in TTS synthesis, it is possible to synthesize voices having different voice characteristics, such as voices having different timbres, voice styles, and the like, respectively, for different characters, character types, styles, and the like. Thus, expressiveness, vividness, and diversity of the synthesized speech can be enhanced, thereby significantly improving the quality of the synthesized speech.

In one aspect, embodiments of the present disclosure may be applied to a scenario in which audio content is synthesized with the voice of a single speaker, which may also be referred to as single speaker audio content generation. The individual speaker may be a pre-designated target speaker, and the voice of the target speaker may be employed to simulate or play different types of roles in the textual content. The character categories may be considered in speech synthesis so that speech corresponding to different character categories, e.g., speech corresponding to young males, speech corresponding to older females, etc., may be generated using the target speaker's voice. Alternatively, styles may also be considered in speech synthesis so that different styles of speech may be generated using the target speaker's voice, e.g., speech corresponding to the emotion type "happy," speech corresponding to the emotion type "sad," etc. By considering the character category and style in the scene of the single speaker audio content generation, the expressiveness, vividness, and the like of the synthesized voice can be enhanced.

In one aspect, embodiments of the present disclosure may be applied to a scenario in which audio content is synthesized with the voices of multiple speakers, which may also be referred to as multi-speaker audio content generation. The voices of different speakers may be used separately for different characters in the text content. These speakers may be predetermined candidates of speakers having different attributes. For a particular character, the speaker's voice can be determined with reference to at least the character category, character personality, and the like. For example, assuming the character Mike is a young male with a good personality, the voice of a speaker with attributes of < young >, < male >, < bright >, etc. may be selected to generate Mike's voice. By automatically assigning different speakers' voices to different characters in speech synthesis, the diversity of audio contents and the like can be enhanced. Alternatively, styles may also be considered in speech synthesis, so that voices of different styles may be generated using voices of speakers corresponding to characters, thereby enhancing expressiveness, vividness, and the like of the synthesized voices. .

Embodiments of the present disclosure may employ various TTS models to synthesize speech in consideration of the above-mentioned reference factors. In one aspect, a TTS model based on language (linguistic) features may be employed. In the scenario of single speaker audio content generation, a speech feature-based TTS model may be trained with a corpus of voices of the target speaker, where the model may generate speech taking into account at least role categories, optional styles, and the like. In a scenario of multi-speaker audio content generation, different versions of a TTS model based on linguistic features may be trained with different speaker candidate voice corpora, and a corresponding version of the model may be selected for a particular character to generate speech for that character, or further different styles of speech may be generated for that character by considering style. In another aspect, a context-based TTS model may be employed. In the scenario of single speaker audio content generation, a context-based TTS model may be trained with a corpus of voices of target speakers, where the model may generate speech taking into account at least the context of the text sentences, role categories, optional styles, and the like. In a scenario of multi-speaker audio content generation, different versions of a context-based TTS model may be trained with different speaker candidate voice corpora, and a corresponding version of the model may be selected for a particular character to generate speech for that character, or further different styles of speech may be generated for that character by considering style.

Embodiments of the present disclosure also provide a flexible customization mechanism for audio content. For example, a user may adjust or customize audio content through a visual customization platform. Various parameters involved in speech synthesis may be modified or set to adjust any portion of the audio content so that a particular utterance in the audio content may have a desired character category, a desired style, and the like. Since the TTS model based on language features has explicit feature input, it can be used to update audio content in response to a user's adjustment indication.

Embodiments of the present disclosure can flexibly use a language feature-based TTS model and/or a context-based TTS model to automatically generate high-quality audio content. The TTS model based on language features may generate high quality speech by considering reference factors determined based on context, and may be used to adjust or update the generated audio content. The context-based TTS model considers not only the reference factors determined based on the context but also the context features extracted from the context itself in speech synthesis, so that the speech synthesis for long texts can be more coordinated. In the audio content generated according to the embodiment of the present disclosure, the words of the character will have stronger expressive power, vividness and diversity, so that the attraction, interest and the like of the audio content can be significantly improved. Automatic audio content generation according to embodiments of the present disclosure is fast and low cost. Furthermore, since the embodiments of the present disclosure convert text contents into high-quality audio contents in a fully automatic manner, the barrier to audio content creation is further lowered, so that not only professional dubbing actors but also general users can conveniently and quickly perform their own unique audio content creation.

FIG. 1 shows an exemplary conventional TTS model 100.

TTS model 100 may be configured to receive text 102 and generate speech waveforms 108 corresponding to text 102. The text 102 may also be referred to as a textual statement, which may include one or more words, phrases, sentences, paragraphs, etc., and the terms "text" and "textual statement" may be used interchangeably herein. It should be understood that although text 102 is shown in FIG. 1 as being provided to TTS model 100, text 102 may also be first converted to a sequence of elements, such as a sequence of phonemes, a sequence of graphemes, a sequence of characters, etc., and then provided to TTS model 100 as input. Herein, the input "text" may broadly refer to words, phrases, sentences, etc. included in the text, or element sequences obtained from the text, such as phoneme sequences, grapheme sequences, character sequences, etc.

TTS model 100 may include an acoustic model 110. The acoustic model 110 may predict or generate the acoustic features 106 from the text 102. The acoustic features 106 may include various TTS acoustic features, such as a Mel spectrum, a Linear Spectrum Pair (LSP), and so forth. The acoustic model 110 can be based on various model architectures, e.g., a sequence-to-sequence model architecture, and so forth. Fig. 1 shows an exemplary sequence-to-sequence acoustic model 110, which may include an encoder 112, an attention module 114, and a decoder 116.

The encoder 112 may transform the information contained in the text 102 into a space that is more robust and more suitable for learning alignment with acoustic features. For example, the encoder 112 may convert information in the text 102 into a sequence of states in the space, which may also be referred to as an encoder state sequence. Each state in the sequence of states corresponds to a phoneme, grapheme, character, etc. in the text 102.

The attention module 114 may implement an attention mechanism. This attention mechanism establishes a connection between the encoder 112 and the decoder 116 to facilitate alignment between the text features and the acoustic features output by the encoder 112. For example, a connection between each decoding step and an encoder state may be established, which may indicate to which encoder state each decoding step should correspond with what weight. The attention module 114 may take as input the encoder state sequence and the output of the previous step of the decoder and generate an attention vector representing the weights with which the next decoding step is aligned to each encoder state.

The decoder 116 may map the state sequence output by the encoder 112 to the acoustic features 106 under the influence of an attention mechanism in the attention module 114. At each decoding step, the decoder 116 may take as input the attention vector output by the attention module 114 and the output of the previous step of the decoder, and output the acoustic features of the frame or frames, e.g., mel-frequency spectra.

TTS model 100 may include a vocoder 120. The vocoder 120 may generate the speech waveform 108 based on the acoustic features 106 predicted by the acoustic model 110.

Fig. 2 illustrates an exemplary process 200 of automatic audio content generation, according to an embodiment. The process 200 may be applied to a scenario of single speaker audio content generation.

The text content 210 is a processing object of automatic audio content generation according to an embodiment, for example, a text storybook, and is intended to generate audio content, for example, a vocal book, by performing the process 200 on a plurality of texts included in the text content 210, respectively. It is assumed that text 212 is currently being extracted from text content 210 and is intended to generate a speech waveform corresponding to text 212 by performing process 200.

At 220, a context 222 corresponding to the text 212 may be constructed. In one implementation, context 222 may include one or more texts adjacent to text 212 in text content 210. For example, context 222 may include at least one sentence before text 212 and/or at least one sentence after text 212. Thus, context 222 is actually a sequence of sentences corresponding to text 212. Optionally, context 222 may also include more text or all text in text content 210.

At 230, a reference factor 232 may be determined based at least on the context 222. Reference factors 232 may affect the characteristics of the synthesized speech in subsequent TTS speech synthesis. Reference factors 232 may include a character category corresponding to text 212 indicating attributes such as age, gender, etc. of the character corresponding to text 212. For example, if the text 212 is an utterance spoken by a young male, the character category corresponding to the text 212 may be determined to be < young >, < male >, or the like. In general, different role categories may correspond to different speech characteristics. Optionally, reference factors 232 may also include a style corresponding to text 212 indicating, for example, what emotion type text 212 was spoken with. For example, if text 212 is an utterance spoken by a character with an angry emotion, the style corresponding to the text 212 may be determined to be < angry >. In general, different styles may correspond to different speech characteristics. The character categories and styles may affect the characteristics of the synthesized speech, either individually or in combination. In one implementation, the role categories, styles, etc. can be predicted based on the context 222 through a pre-trained predictive model at 230.

In accordance with process 200, a TTS model 240 pre-trained for a target speaker may be employed to generate speech waveforms. The target speaker may be a speaker automatically determined in advance or a speaker designated by the user. TTS model 240 may synthesize speech using the target speaker's voice. In one implementation, TTS model 240 may be a language feature-based TTS model, where, unlike conventional language feature-based TTS models, language feature-based TTS model 240 may synthesize speech in consideration of at least a reference factor. The language-feature-based TTS model 240 may generate a speech waveform 250 corresponding to the text 212 based on at least the text 212 and the role category if the reference factors 232 include the role category, or may generate a speech waveform 250 corresponding to the text 212 based on at least the text 212, the role category, and the style if the reference factors 232 include both the role category and the style. In one implementation, TTS model 240 may be a context-based TTS model, where context-based TTS model 240 may synthesize speech considering at least reference factors, unlike conventional context-based TTS models. Context-based TTS model 240 may generate speech waveform 250 corresponding to text 212 based on at least text 212, context 222, and role category where reference factors 232 include a role category, or may generate speech waveform 250 corresponding to text 212 based on at least text 212, context 222, role category, and style where reference factors 232 include both a role category and style.

In a similar manner, a plurality of speech waveforms corresponding to a plurality of texts included in the text content 210 may be generated by the process 200. All of these speech waveforms may together form audio content corresponding to the textual content 210. The audio content may include different character categories and/or different styles of speech synthesized using the target speaker's voice.

Fig. 3 illustrates an exemplary process 300 of automatic audio content generation according to an embodiment. The process 300 may be applied to a scenario of multi-speaker audio content generation.

Assume that the current text 312 is taken from the text content 310 and that the speech waveform corresponding to the text 312 is intended to be generated by performing the process 300.

At 320, a context 322 corresponding to the text 312 can be constructed. In one implementation, context 322 may include one or more texts adjacent to text 312 in text content 310. Optionally, context 322 may also include more text or all text in textual content 310.

At 330, reference factors 332 can be determined based at least on the context 322, which are used to influence the characteristics of the synthesized speech in the subsequent TTS speech synthesis. Reference 332 may include a role category corresponding to text 312. Reference 332 may also include a character personality corresponding to text 312 that indicates the personality of the character to which text 312 corresponds. For example, if the text 312 is an utterance spoken by an evil enchanter's witch, the character personality corresponding to the text 312 may be determined to be < evil >. In general, different personalities may correspond to different voice characteristics. Reference 332 may also include a character corresponding to text 312 indicating which character in textual content 310 text 312 was spoken by. In general, different characters may employ different sounds. Optionally, reference 332 may also include a style corresponding to text 312. In one implementation, different reference factors may be predicted based on context 322 by different pre-trained predictive models at 330. These predictive models may include, for example, predictive models for predicting character categories and styles, predictive models for predicting character persona, predictive models for predicting characters, and the like.

In accordance with process 300, a TTS model to be used may be selected at 340 from a library 350 of pre-prepared candidate TTS models. The candidate TTS model library 350 may include a plurality of candidate TTS models pre-trained for different candidate speakers. Each candidate speaker may have attributes in terms of at least one of a role category, a role personality, a role, and the like. For example, attributes of candidate speaker 1 may include < old >, < female >, < nefarious > and < witch >, attributes of candidate speaker 2 may include < middle age >, < male > and < kairan >, and so on. Candidate speakers corresponding to text 312 may be determined using at least one of the role category, role character, and role in reference factors 332, and the TTS model corresponding to the determined candidate speakers may be selected accordingly.

Assume that TTS model 360 is selected from candidate TTS model library 350 at 340 for generating speech waveforms for text 312. TTS model 360 may synthesize speech using the voices of the speaker corresponding to the model. In one implementation, TTS model 360 may be a language feature-based TTS model that may generate speech waveforms 370 corresponding to text 312 based at least on text 312. Where reference factors 332 include style, speech waveform 370 may be generated by speech feature-based TTS model 360 further based on style. In one implementation, TTS model 360 may be a context-based TTS model that may generate speech waveform 370 corresponding to text 312 based at least on text 312 and context 322. Where reference factors 332 include style, speech waveform 370 may be generated by context-based TTS model 360 further based on style.

In a similar manner, a plurality of speech waveforms corresponding to a plurality of texts included in the text content 310 may be generated by the process 300. All of these speech waveforms may together form audio content corresponding to textual content 310. The audio content may include speech synthesized with the voices of different speakers automatically assigned to different roles, and optionally, the speech may have different styles.

Fig. 4 illustrates an exemplary process 400 of preparing training data according to an embodiment.

Sets of matching audio content 402 and text content 404, e.g., vocal books and corresponding text storybooks, can be obtained in advance.

At 410, automatic segmentation may be performed on the audio content 402. For example, the audio content 402 may be automatically divided into a plurality of audio segments, each of which may correspond to one or more speech utterances. The automatic segmentation at 410 may be performed by any known audio segmentation technique.

At 420, post-processing may be performed on the divided plurality of audio segments using the textual content 404. In one aspect, post-processing at 420 may include utilizing the textual content 404 for utterance integrity re-segmentation. For example, the textual content 404 may be readily divided into a plurality of textual statements by any known text segmentation technique, and the audio segment corresponding to each textual statement is then determined with reference to the textual statement. For each text statement, one or more of the audio segments obtained at 410 may be segmented or combined to match the text statement. Accordingly, it is possible to achieve alignment of the audio segment with the text sentence at the point of time and form a plurality of < text sentence, audio segment > pairs. In another aspect, post-processing at 420 may include classifying the audio segment into voice-over and dialogue. For example, by performing a classification process for recognizing the voice-over and the dialogue on a text sentence, an audio segment corresponding to the text sentence can be classified into the voice-over and the dialogue.

At 430, a tag may be added for the < text statement, audio segment > pair relating to the dialog. The indicia may include, for example, a role category, a genre, and the like. In one case, the role category, style, etc. of each < text sentence, audio segment > pair can be determined by means of automatic clustering. In another case, the role category, style, etc. of each < text sentence, audio segment > pair may be artificially labeled.

Through the process 400, a set of labeled training data 406 may be ultimately obtained. Each piece of training data may have a form such as < text sentence, audio segment, character category, style >. The training data 406 may in turn be applied to train a prediction model for predicting a character category and style corresponding to text, a TTS model for generating speech waveforms, and the like.

It should be appreciated that the above process 400 is merely exemplary, and that any other form of training data may be prepared in a similar manner depending on the particular application scenario and design. For example, the indicia added at 430 can also include roles, role characters, and the like.

Fig. 5 illustrates an exemplary process 500 of predicting role categories and styles, according to an embodiment. Process 500 may be performed by a predictive model for predicting role categories and styles. The predictive model may automatically assign role categories and styles to the text.

For text 502, a context corresponding to text 502 can be constructed at 510. The processing at 510 may be similar to the processing at 220 in fig. 2.

The constructed context may be provided to a pre-trained language model 520. Language model 520 is used to model and express textual information, which may be trained to generate an implicit spatial expression (e.g., an embedded expression) for the input text. Language model 520 may be based on any suitable technique, such as bi-directional encoder expressions (BERT) from transformers, and the like.

The embedded expressions output by the language model 520 may be provided to a mapping (projection) layer 530 and a softmax layer 540, in that order. Mapping layer 530 may convert the embedded expression into a mapped expression, and softmax layer 540 may calculate probabilities of different character categories and probabilities of different styles based on the mapped expression, thereby ultimately determining character category 504 and style 506 corresponding to text 502.

The predictive model used to perform process 500 may be trained using training data obtained through process 400 of fig. 4. For example, the predictive model may be trained using training data in the form of < text, role category, style >. In the stage of applying the trained predictive model, the predictive model may predict a character category and style corresponding to the text based on the input text.

Although the predictive models described above in connection with process 500 may jointly predict character categories and styles, it should be understood that separate predictive models may be employed to predict character categories and styles, respectively, by a similar process.

FIG. 6 illustrates an exemplary implementation 600 for speech synthesis using a TTS model based on language features, according to an embodiment. This implementation 600 may be applied to the scenario of the generation of the single speaker audio content, which may be considered an exemplary specific implementation of the process 200 of FIG. 2. Assume that in fig. 6 it is desired to generate speech waveforms for text 602.

Styles 612 and role categories 614 corresponding to text 602 can be predicted by predictive models 610. The predictive model 610 may make predictions based on, for example, the process 500 of fig. 5. Role category embedded expressions 616 corresponding to role categories 614 may be obtained, for example, by a role category embedded look-up table (LUT). At 618, the styles 612 can be concatenated with the role category embedded expressions 616 to obtain a concatenated expression. At 620, a corresponding implicit expression can be generated based on the concatenated expressions. The implicit expression generation at 620 may be performed in various ways, such as a gaussian mixture variational self-encoder (GMVAE), a vector quantization VAE (VQ-VAE), a VAE, a global style representation (GST), and so on. Taking GMVAE as an example, it can learn the distribution of implicit variables, and in the model application phase, the implicit expression can be sampled from the posterior probability, or the prior mean value is directly used as the implicit expression.

At 630, front end analysis may be performed on the text 602 to extract phoneme features 632 and prosody (prosody) features 634. The front-end analysis at 630 may be performed using any known TTS front-end analysis technique. The phoneme feature 632 may refer to a sequence of phonemes extracted from the text 602. Prosodic features 634 may refer to prosodic information corresponding to text 602, such as pauses (break), accents (accent), rates, and so forth.

The phoneme features 632 and prosodic features 634 may be encoded using an encoder 640. The encoder 640 may be based on any architecture. As an example, one example of an encoder 640 is given in fig. 7. Fig. 7 shows an exemplary implementation of the encoder 710 in a speech feature based TTS model according to an embodiment. The encoder 710 may correspond to the encoder 640 in fig. 6. The encoder 710 may encode the phoneme features 702 and the prosodic features 704, where the phoneme features 702 and the prosodic features 704 may correspond to the phoneme features 632 and the prosodic features 634 in fig. 6, respectively. The phoneme features 702 and prosodic features 704 may be feature extracted by processing through a 1-D convolution filter 712, a max-pooling layer 714, a 1-D convolution map 716, respectively. At 718, the output of the 1-D convolution mapping 716 may be superimposed (add) with the phoneme features 702 and prosodic features 704. The superimposed output at 718 may then be processed through a high speed network layer 722 and a Bidirectional Long Short Term Memory (BLSTM) layer 724 to obtain an encoder output 712. It should be understood that the architecture and all components in fig. 7 are exemplary, and that the encoder 710 may have any other implementation depending on the particular needs and design.

According to process 600, at 642, the output of encoder 640 and the implicit expression obtained at 620 can be superimposed.

The language feature based TTS model in FIG. 6 may be used for single speaker audio content generation. Thus, the TTS model may be trained for the target speaker 604. The speaker-embedded representation 606 corresponding to the target speaker 604 may be obtained, for example, by a speaker-embedded LUT, and the speaker-embedded representation 606 may be used to influence the TTS model for speech synthesis with the voice of the target speaker 604. At 644, the superimposed output at 642 may be concatenated with the speaker-embedded representation 606, and the concatenated output may be provided to attention module 650.

The decoder 660 may generate acoustic features, e.g., mel-frequency spectral features, etc., under the influence of the attention module 650. The vocoder 670 may generate the speech waveform 608 corresponding to the text 602 based on the acoustic features.

Implementation 600 in FIG. 6 is intended to illustrate an exemplary architecture for speech synthesis using a TTS model based on linguistic features. Model training may be performed using at least training data obtained by, for example, process 400 of fig. 4. For example, through process 400, text and speech waveform pairs, and corresponding indicia of character category, style, etc., may be obtained. It should be appreciated that alternatively, in the actual application phase, the input of speaker-embedded expressions may be omitted, as the TTS model has been trained to synthesize speech based on the voice of the target speaker. Further, it should be appreciated that any of the components and processes in implementation 600 are exemplary and that implementation 600 may be modified in any manner depending on the particular needs and design.

FIG. 8 illustrates an exemplary implementation 800 for speech synthesis using a context-based TTS model according to an embodiment. This implementation 800 may be applied to the scenario of the generation of the single speaker audio content, which may be considered an exemplary specific implementation of the process 200 of FIG. 2. Implementation 800 is similar to implementation 600 in FIG. 6, except that the TTS model employs context coding. Assume that in fig. 8 it is desired to generate a speech waveform for text 802.

Styles 812 and character categories 814 corresponding to text 802 can be predicted by predictive models 810. The predictive model 810 may be similar to the predictive model 610 of fig. 6. A role category embedded expression 816 corresponding to the role category 814 can be obtained, and at 818, the genre 812 can be concatenated with the role category embedded expression 816 to obtain a concatenated expression. At 820, a corresponding implicit expression can be generated based on the concatenated expressions. The implicit expression generation at 820 may be similar to the implicit expression generation at 620 in fig. 6.

Phoneme features 832 may be extracted by performing a front end analysis (not shown) on the text 802. Encoding of the phoneme features 832 may be performed using the phoneme encoder 830. The phoneme coder 830 may be similar to the coder 640 of fig. 6, except that only phoneme features are taken as input.

Context information 842 may be extracted from text 802 and context information 842 encoded using context encoder 840. Context information 842 may correspond to, for example, context 222 in fig. 2, or various information further extracted from context 222 that is applicable to context encoder 840. The context encoder 840 may be any known context encoder that may be used in a TTS model. As an example, one example of a context encoder 840 is given in fig. 9. Fig. 9 shows an exemplary implementation of a context encoder 900 in a context-based TTS model. The context encoder 900 may correspond to the context encoder 840 in fig. 8. The context encoder 900 may perform encoding on context information 902, which context information 902 may correspond to context information 842 in fig. 8. The context encoder 900 may include a word encoder 910 for performing encoding on a current text, such as the text 802, to obtain a current semantic feature. The word encoder 910 may include, for example, an embedding layer to generate a word embedding sequence for a sequence of words in a current text, an upsampling layer to upsample the word embedding sequence to align with a sequence of phonemes for the current text, an encoding layer to encode the upsampled word embedding sequence into current semantic features through, for example, a convolutional layer, a BLSTM layer, or the like. Historical text, future text, paragraph text, etc. may be extracted from the context information 902. The historical text may include one or more sentences located before the current text, the future text may include one or more sentences located after the current text, and the paragraph text may include all sentences in the paragraph in which the current text is located. The context encoder may include a history and future encoder 920 for performing encoding on the history text, the future text, and the paragraph text to obtain history semantic features, future semantic features, and paragraph semantic features, respectively. The historical and future encoder 920 may include, for example, an embedding layer to generate word embedding sequences for word sequences in the input text, an upsampling layer to upsample the word embedding sequences to align with phoneme sequences of the current text, a dense layer to produce compressed expressions for the upsampled word embedding sequences, an encoding layer to encode the compressed word embedding sequences into semantic features corresponding to the input text, and so on. It should be understood that although a single historical and future encoder is shown in fig. 9, separate encoders may be provided for the historical text, the future text, and the paragraph text, respectively, to generate the respective semantic features independently of each other. Current semantic features, historical semantic features, future semantic features, paragraph semantic features, etc. may be overlaid at 930 to output contextual features 904.

According to process 800, at 852, the output of the phoneme encoder 830, the output of the context encoder 840, and the implicit expression obtained at 820 can be superimposed. In addition, the output of the context encoder 840 may also be provided to the attention module 844.

The language feature based TTS model in FIG. 8 may be used for single speaker audio content generation. Thus, the TTS model may be trained for the target speaker 804. A speaker-embedded representation 806 corresponding to the target speaker 804 may be obtained for influencing the TTS model for speech synthesis with the voice of the target speaker 804. At 854, the superimposed output at 852 can be concatenated with speaker embedded representation 806, and the concatenated output can be provided to attention module 860.

At 870, the output of attention module 844 may be concatenated with the output of attention module 860 to affect the generation of acoustic features at decoder 880. Vocoder 890 may generate speech waveform 808 corresponding to text 802 based on the acoustic features.

The model training may be performed in fig. 8 using at least training data obtained by, for example, process 400 of fig. 4. It should be appreciated that alternatively, in the actual application phase, the input of speaker-embedded expressions may be omitted, as the TTS model has been trained to synthesize speech based on the voice of the target speaker. Further, it should be appreciated that any of the components and processes in implementation 800 are exemplary and that implementation 800 may be modified in any manner depending on the particular needs and design.

Fig. 10 illustrates an exemplary process 1000 of predicting roles and selecting a TTS model, according to an embodiment. Process 1000 may be performed in the context of multi-speaker audio content generation, which is an exemplary implementation of at least a portion of process 300 in FIG. 3. Process 1000 may be used to determine a particular character corresponding to text and select a TTS model trained based on the voice of the speaker corresponding to the character.

The text 1002 is from text content 1004. At 1010, a context corresponding to the text 1002 can be constructed. At 1020, an embedded representation of the context may be generated by, for example, a pre-trained language model.

At 1030, a plurality of candidate characters can be extracted from the textual content 1004. Assuming that the textual content 1004 is a textual storybook, all characters involved in the textual storybook can be extracted at 1030 to form a list of candidate characters. Candidate role extraction at 1030 can be performed by any known technique.

At 1040, context-based candidate feature extraction may be performed. For example, for current text 1002, one or more candidate features may be extracted from the context for each candidate character. Assuming a total of N candidate roles, N candidate feature vectors may be obtained at 1040, where each candidate feature vector includes candidate features extracted for one candidate role. Various types of features may be extracted at 1040. In one implementation, the extracted features may include the number of words that are spaced between the current text and the candidate character. Since the name of a character typically occurs near the utterance of that character, this feature helps determine whether the current text was spoken by some candidate character. In one implementation, the extracted features may include the number of times the role candidate occurs in the context. The features may reflect the relative importance of a particular candidate character in the textual content. In one implementation, the extracted features may include binary features indicating whether the names of candidate characters appear in the current text. In general, a character is unlikely to mention its name in a spoken utterance. In one implementation, the extracted features may include binary features indicating whether the names of candidate characters appear in the closest preceding text or in the closest subsequent text. Since the speaker-alternating pattern is often employed in conversations between two characters, it is highly likely that the character speaking the current text will appear in the closest preceding text or the closest following text. It should be understood that the extracted features may also include any other features that are useful in determining, for example, a character corresponding to text 1002.

At 1050, the context-embedded expression generated at 1020 may be combined with all of the candidate feature vectors extracted at 1040 to form a candidate feature matrix corresponding to all of the candidate roles.

According to process 1000, a role 1062 corresponding to text 1002 may be determined from a plurality of candidate roles based at least on context through a learning ranking (LTR) model 1060. For example, the LTR model 1060 may rank the plurality of candidate characters based on a candidate feature matrix obtained from the context, and determine the highest ranked candidate character as the character 1062 corresponding to the text 1002. The LTR model 1060 can be constructed using various techniques, such as sorting Support Vector Machines (SVMs), sorting networks (RankNet), ordered (Ordinal) classification, and the like. It should be understood that the LTR model 1060 may be considered herein as a prediction model for predicting a character based on context, or more broadly, the combination of the LTR model 1060 and

steps

1010, 1020, 1030, 1040, and 1050 may be considered as a prediction model for predicting a character based on context.

According to process 1000, a character personality 1072 of a character corresponding to text 1002 may optionally be predicted based at least on context by personality prediction model 1070. For example, the character prediction model 1070 may predict the character 1072 based on a candidate feature matrix obtained from the context. The personality prediction model 1070 may be constructed based on a process similar to process 500 of fig. 5, except that it is trained for the role personality classification task using text and role personality training data pairs.

In accordance with process 1000, a TTS model 1090 to be used may be selected at 1080 from a library of pre-prepared candidate TTS models 1082. The candidate TTS model library 1082 may include a plurality of candidate TTS models pre-trained for different candidate speakers. Each candidate speaker may have attributes in terms of at least one of a role category, a role personality, a role, and the like. At 1080, a candidate speaker corresponding to text 1002 may be determined with at least one of role 1062, role personality 1072, and role category 1006, and a TTS model 1090 corresponding to the determined candidate speaker may be selected accordingly. Role categories 1006 can be determined by, for example, process 500 of fig. 5.

It should be understood that any of the steps and processes in process 1000 are exemplary and that process 1000 may be modified in any manner depending on the particular needs and design.

FIG. 11 illustrates an exemplary implementation 1100 for speech synthesis using a TTS model based on language features, according to an embodiment. This implementation 1100 may be applied to a scenario of multi-speaker audio content generation, which may be considered an exemplary specific implementation of the process 300 of FIG. 3. Assume that in fig. 11 it is desired to generate speech waveforms for text 1102.

Character attributes 1112 corresponding to text 1102 may be predicted by prediction model 1110. The prediction model 1110 may correspond to, for example, the personality prediction model 1070 in fig. 10. Character 1122 corresponding to text 1102 may be predicted by predictive model 1120. Predictive model 1120 may correspond to a predictive model for predicting a character, such as described above in connection with fig. 10. The character category 1132 and style 1134 corresponding to the text 1102 may be predicted by the prediction model 1130. The predictive model 1130 may be constructed based on, for example, the process 500 of FIG. 5. At 1136, an implicit expression corresponding to the style 1134 may be generated. At 1140, front end analysis may be performed on the text 1102 to extract phoneme features 1142 and prosodic features 1144. The phoneme features 1142 and prosody features 1144 may be encoded using the encoder 1150. At 1152, the output of the encoder 1150 and the implicit expression obtained at 1136 can be superimposed.

The language feature based TTS model in FIG. 11 may be used for multi-speaker audio content generation. Candidate speaker 1104 corresponding to character 1122 may be determined based on at least one of character 1122, character personality 1112, and character category 1132 in a manner similar to process 1000 of FIG. 10. The TTS model may be trained for candidate speakers 1104. The speaker-embedded expressions 1106 corresponding to the candidate speakers 1104 may be obtained, for example, by speaker-embedded LUTs, and the speaker-embedded expressions 1106 may be used to influence the TTS model for speech synthesis with the voice of the candidate speakers 1104. At 1154, the superimposed output at 1152 may be concatenated with the speaker-embedded expression 1106, and the concatenated output may be provided to an attention module 1160.

The decoder 1170 may generate acoustic features under the influence of the attention module 1160. The vocoder 1180 may generate the speech waveform 1108 corresponding to the text 1102 based on the acoustic features.

The implementation 1100 in FIG. 11 is intended to illustrate an exemplary architecture for speech synthesis using a TTS model based on linguistic features. A plurality of candidate TTS models can be obtained by constructing corresponding TTS models for different candidate speakers. In a practical application stage, a candidate speaker corresponding to a character may be determined based on at least one of the character, character personality and character category, and a TTS model trained for the candidate speaker may be selected for generating a speech waveform. Further, it should be appreciated that any of the components and processes in implementation 1100 are exemplary and that implementation 1100 may be altered in any manner depending on the particular needs and design.

FIG. 12 illustrates an exemplary implementation 1200 of speech synthesis using a context-based TTS model according to an embodiment. This implementation 1200 may be applied to a scenario of multi-speaker audio content generation, which may be considered an exemplary specific implementation of the process 300 of fig. 3. Implementation 1200 is similar to implementation 1100 in FIG. 11, except that the TTS model employs context coding. Assume that in fig. 12 it is desired to generate a speech waveform for text 1202.

Character traits 1212 corresponding to text 1202 may be predicted by prediction model 1210. The prediction model 1210 may be similar to the prediction model 1110 in fig. 11. The character 1222 corresponding to the text 1202 may be predicted by a prediction model 1220. Prediction model 1220 may be similar to prediction model 1120 in fig. 11. Role categories 1232 and styles 1234 corresponding to text 1202 may be predicted by predictive models 1230. The prediction model 1230 may be similar to the prediction model 1130 in FIG. 11. At 1236, an implicit expression corresponding to the style 1234 can be generated.

Phoneme features 1242 may be extracted by performing a front end analysis (not shown) on the text 1202. Encoding may be performed on the phoneme features 1242 using the phoneme encoder 1240. The phoneme encoder 1240 may be similar to the phoneme encoder in fig. 8. Context information 1252 may be extracted from text 1202 and context information 1252 may be encoded using context encoder 1250. The context encoder 1250 may be similar to the context encoder 840 in fig. 8.

At 1262, the output of the phoneme encoder 1240, the output of the context encoder 1250, and the implicit expression obtained at 1236 may be superimposed. The output of the context encoder 1250 may also be provided to an attention module 1254.

The context-based TTS model in FIG. 12 may be used for multi-speaker audio content generation. Candidate speakers 1204 corresponding to role 1222 may be determined based on at least one of role 1222, role personality 1212, and role category 1232 in a manner similar to process 1000 of fig. 10. The TTS model may be trained for the candidate speakers 1204. Speaker-embedded expressions 1206 corresponding to the speaker candidates 1204 may be used to influence the TTS model for speech synthesis with the voices of the speaker candidates 1204. At 1264, the superimposed output at 1262 may be concatenated with the speaker-embedded representation 1206, and the concatenated output may be provided to an attention module 1270.

At 1272, the output of attention module 1254 may be cascaded with the output of attention module 1270 to affect the generation of acoustic features at decoder 1280. The vocoder 1290 may generate a speech waveform 1208 corresponding to the text 1202 based on the acoustic features.

Implementation 1200 in FIG. 12 is intended to illustrate an exemplary architecture for speech synthesis employing a context-based TTS model. A plurality of candidate TTS models can be obtained by constructing corresponding TTS models for different candidate speakers. In a practical application stage, a candidate speaker corresponding to a character may be determined based on at least one of the character, character personality and character category, and a TTS model trained for the candidate speaker may be selected for generating a speech waveform. Further, it should be appreciated that any of the components and processes in implementation 1200 are exemplary and that implementation 1200 may be modified in any manner depending on the particular needs and design.

According to embodiments of the present disclosure, audio content may also be customized. For example, a speech waveform in the generated audio content may be adjusted to update the audio content.

Fig. 13 illustrates an exemplary process 1300 of updating audio content according to an embodiment.

Assume that a user provides textual content 1302 and wants to obtain audio content corresponding to the textual content 1302. Audio content 1304 corresponding to text content 1302 may be created by performing audio content generation at 1310. The audio content generation at 1310 may be based on any implementation of the automatic audio content generation according to embodiments of the present disclosure described above in connection with fig. 2-12.

The audio content 1304 may be provided to a customization platform 1320. The customization platform 1320 may include a user interface for interacting with a user. Through the user interface, audio content 1304 may be provided and presented to a user, and an indication 1306 of a user's adjustment to at least a portion of the audio content may be received. For example, if the user is not satisfied with a certain utterance in the audio content 1304 or wants to modify the utterance to a desired character category, a desired style, etc., the user may enter an adjustment indication 1306 through the user interface.

Adjustment indication 1306 may include modifications or settings to various parameters involved in speech synthesis. In one implementation, the adjustment indication may include adjustment information regarding prosodic information. The prosodic information may include, for example, at least one of pauses, accents, pitch, and rate. For example, the user may specify pauses before or after a certain word, specify accents for a certain utterance, change the pitch of a certain word, adjust the rate of a certain utterance, and so forth. In one implementation, the adjustment indication may include adjustment information regarding the pronunciation. For example, the user may specify the correct pronunciation that a certain polyphonic character should have in the current audio content, and so on. In one implementation, the adjustment indication may include adjustment information regarding the role category. For example, the user may specify a desired character category of "elderly men" for utterances having a "middle-aged man" timbre. In one implementation, the adjustment indication may include adjustment information about the style. For example, a user may specify a desired "happy" emotion for an utterance having a "sad" emotion. In one implementation, the adjustment indication may include adjustment information regarding the acoustic parameter. For example, a user may specify particular acoustic parameters for a certain utterance. It should be understood that the above list only a few examples of adjustment indications 1306, and that adjustment indications 1306 may also include modifications or settings to any other parameter that can affect speech synthesis.

In accordance with process 1300, in response to adjustment indication 1306, customization platform 1320 may invoke TTS model 1330 to regenerate the speech waveform. Assuming that the adjustment indication 1306 is for a certain utterance or corresponding speech waveform in the audio content 1304, text corresponding to the speech waveform may be provided to the TTS model 1330 along with the adjustment information in the adjustment indication. TTS model 1330 may in turn regenerate speech waveform 1332 for the text conditioned on the adjustment information. Taking the example where the adjustment indication 1306 includes adjustment information regarding a role category, the role category specified in the adjustment indication 1306 may be utilized in place of, for example, the role category determined in fig. 2, and further a speech waveform may be generated by the TTS model. In one implementation, the TTS model 1330 may employ a language feature-based TTS model, since the language feature-based TTS model has explicit feature inputs that can be controlled by parameters corresponding to the adjustment indication.

The previous speech waveform in audio content 1304 may be replaced with the regenerated speech waveform 1332 to form updated audio content 1308.

Process 1300 may be performed iteratively, thereby enabling continuous adjustment and optimization of the generated audio content. It should be appreciated that any of the steps and processes in process 1300 are exemplary and that process 1300 may be varied in any manner depending on the particular needs and design.

Fig. 14 shows a flow of an exemplary method 1400 for automatic audio content generation, according to an embodiment.

At 1410, text may be obtained.

At 1420, a context corresponding to the text can be constructed.

At 1430, reference factors can be determined based at least on the context, the reference factors including at least a role category and/or role corresponding to the text.

At 1440, a speech waveform corresponding to the text can be generated based at least on the text and the reference factor.

In one implementation, the reference factor may further include a style corresponding to the text.

In one implementation, the determining the reference factor may include: predicting, by a prediction model, the role category based at least on the context.

The generating of the voice waveform may include: generating the speech waveform based on at least the text and the role category through a TTS model based on language features. The TTS model based on the language features may be pre-trained for a target speaker.

The generating of the voice waveform may include: generating, by a context-based TTS model, the speech waveform based on at least the text, the context, and the role category. The context-based TTS model may be pre-trained for a target speaker.

In one implementation, the determining the reference factor may include: extracting a plurality of candidate characters from text content including the text; and determining, by an LTR model, the role from the plurality of candidate roles based at least on the context.

In one implementation, the generating the speech waveform may include: selecting a TTS model corresponding to the role from a plurality of candidate TTS models trained in advance, wherein the plurality of candidate TTS models are respectively trained in advance for different speakers; and generating the speech waveform through the selected TTS model.

The determining the reference factor may include: predicting, by a first prediction model, the role category based at least on the context; predicting, by a second predictive model, the role based at least on the context; and predicting, by a third predictive model, a character personality based at least on the context. The selecting a TTS model may include: selecting the TTS model from the plurality of candidate TTS models based on at least one of the role, the role category, and the role character.

The TTS model selected may be a language feature-based TTS model, and the generating the speech waveform may include: generating, by the language feature based TTS model, the speech waveform based at least on the text.

The selected TTS model may be a context-based TTS model, and the generating the speech waveform may include: generating, by the context-based TTS model, the speech waveform based on at least the text and the context.

In one implementation, the speech waveform may be generated further based on a style corresponding to the text.

In one implementation, the method 1400 may further include: receiving an adjustment indication for the speech waveform; and in response to the adjustment indication, regenerating a speech waveform corresponding to the text through a TTS model based on language features.

The adjustment indication may comprise at least one of: adjustment information on prosodic information, the prosodic information including at least one of pauses, accents, pitches, rates; adjustment information about pronunciation; adjustment information regarding the role category; adjustment information about the style; and adjustment information regarding the acoustic parameter.

It should be understood that method 1400 may also include any of the steps/processes for automatic audio content generation according to embodiments of the present disclosure described above.

Fig. 15 illustrates an exemplary apparatus 1500 for automatic audio content generation, according to an embodiment.

The apparatus 1500 may include: a text obtaining module 1510 configured to obtain a text; a context construction module 1520 for constructing a context corresponding to the text; a reference factor determination module 1530 for determining reference factors based on at least the context, the reference factors including at least a role category and/or a role corresponding to the text; and a speech waveform generation module 1540 for generating a speech waveform corresponding to the text based on at least the text and the reference factor.

In one implementation, the reference factor determination module 1530 may be configured to: predicting, by a prediction model, the role category based at least on the context.

The speech waveform generation module 1540 may be configured to: generating the speech waveform based on at least the text and the role category through a TTS model based on language features. The TTS model based on the language features may be pre-trained for a target speaker.

The speech waveform generation module 1540 may be configured to: generating, by a context-based TTS model, the speech waveform based on at least the text, the context, and the role category. The context-based TTS model may be pre-trained for a target speaker.

In one implementation, the reference factor determination module 1530 may be configured to: extracting a plurality of candidate characters from text content including the text; and determining, by an LTR model, the role from the plurality of candidate roles based at least on the context.

In one implementation, the voice waveform generation module 1540 may be configured to: selecting a TTS model corresponding to the role from a plurality of candidate TTS models trained in advance, wherein the plurality of candidate TTS models are respectively trained in advance for different speakers; and generating the speech waveform through the selected TTS model.

Furthermore, the apparatus 1500 may also include any other modules that perform the steps of the method for automatic audio content generation according to embodiments of the present disclosure described above.

Fig. 16 illustrates an exemplary apparatus 1600 for automatic audio content generation, according to an embodiment.

Apparatus 1600 may include: at least one processor 1610; and a memory 1620 storing computer-executable instructions that, when executed, cause the at least one processor 1610 to: obtaining a text; constructing a context corresponding to the text; determining reference factors based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text; and generating a speech waveform corresponding to the text based on at least the text and the reference factor. Further, the processor 1610 may also perform any other steps/processes of the method for automatic audio content generation according to the embodiments of the present disclosure described above.

Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the operations of the method for automatic audio content generation according to embodiments of the present disclosure described above.

It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of the operations, but rather should encompass all other equivalent variations under the same or similar concepts.

It should also be understood that all of the modules in the above described apparatus may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.

The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. By way of example, the processor, any portion of the processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Programmable Logic Device (PLD), state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software should be viewed broadly as representing instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable ROM (prom), an erasable prom (eprom), an electrically erasable prom (eeprom), a register, or a removable disk. Although the memory is shown as being separate from the processor in aspects presented in this disclosure, the memory may be located internal to the processor (e.g., a cache or a register).

The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the claims.

Claims

1. A method for automatic audio content generation, comprising:

obtaining a text;

constructing a context corresponding to the text;

determining reference factors based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text; and

generating a speech waveform corresponding to the text based on at least the text and the reference factor.

2. The method of claim 1, wherein,

the reference factors further include a style corresponding to the text.

3. The method of claim 1, wherein the determining a reference factor comprises:

predicting, by a prediction model, the role category based at least on the context.

4. The method of claim 3, wherein the generating a speech waveform comprises:

generating the speech waveform based on at least the text and the role category through a text-to-speech (TTS) model based on language features,

wherein the TTS model based on language features is pre-trained for a target speaker.

5. The method of claim 3, wherein the generating a speech waveform comprises:

generating, by a context-based text-to-speech (TTS) model, the speech waveform based on at least the text, the context, and the role category,

wherein the context-based TTS model is pre-trained for a target speaker.

6. The method of claim 1, wherein the determining a reference factor comprises:

extracting a plurality of candidate characters from text content including the text; and

determining the role from the plurality of candidate roles based at least on the context through a learning ranking (LTR) model.

7. The method of claim 1, wherein the generating a speech waveform comprises:

selecting a text-to-speech (TTS) model corresponding to the role from a plurality of candidate TTS models trained in advance, the plurality of candidate TTS models being pre-trained for different speakers, respectively; and

generating the speech waveform through the selected TTS model.

8. The method of claim 7, wherein the determining a reference factor comprises:

predicting, by a first prediction model, the role category based at least on the context;

predicting, by a second predictive model, the role based at least on the context; and

predicting character traits based at least on the context by a third predictive model, and

wherein the selecting a TTS model comprises: selecting the TTS model from the plurality of candidate TTS models based on at least one of the role, the role category, and the role character.

9. The method of claim 7, wherein the selected TTS model is a language feature-based TTS model, and the generating a speech waveform comprises:

generating, by the language feature based TTS model, the speech waveform based at least on the text.

10. The method of claim 7, wherein the selected TTS model is a context-based TTS model, and the generating a speech waveform comprises:

generating, by the context-based TTS model, the speech waveform based on at least the text and the context.

11. The method of any one of claims 4, 5, 9, 10,

the speech waveform is further generated based on a style corresponding to the text.

12. The method of claim 1, further comprising:

receiving an adjustment indication for the speech waveform; and

in response to the adjustment indication, regenerating a speech waveform corresponding to the text through a text-to-speech (TTS) model based on language features.

13. The method of claim 12, wherein the adjustment indication comprises at least one of:

adjustment information on prosodic information, the prosodic information including at least one of pauses, accents, pitches, rates;

adjustment information about pronunciation;

adjustment information regarding the role category;

adjustment information about the style; and

adjustment information regarding the acoustic parameter.

14. An apparatus for automatic audio content generation, comprising:

the text obtaining module is used for obtaining a text;

the context construction module is used for constructing a context corresponding to the text;

a reference factor determination module to determine reference factors based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text; and

a speech waveform generation module to generate a speech waveform corresponding to the text based at least on the text and the reference factor.

15. The apparatus of claim 14, wherein the reference factor determination module is to:

16. The apparatus of claim 15, wherein the speech waveform generation module is to:

17. The apparatus of claim 15, wherein the speech waveform generation module is to:

wherein the context-based TTS model is pre-trained for a target speaker.

18. The apparatus of claim 14, wherein the reference factor determination module is to:

19. The apparatus of claim 14, wherein the speech waveform generation module is to:

generating the speech waveform through the selected TTS model.

20. An apparatus for automatic audio content generation, comprising:

at least one processor; and

a memory storing computer-executable instructions that, when executed, cause the at least one processor to:

the text is obtained and the text is obtained,

a context corresponding to the text is constructed,

determining reference factors based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text, an