CN113628609A - Automatic audio content generation - Google Patents
Automatic audio content generation Download PDFInfo
- Publication number
- CN113628609A CN113628609A CN202010387249.8A CN202010387249A CN113628609A CN 113628609 A CN113628609 A CN 113628609A CN 202010387249 A CN202010387249 A CN 202010387249A CN 113628609 A CN113628609 A CN 113628609A
- Authority
- CN
- China
- Prior art keywords
- text
- context
- model
- tts
- role
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 97
- 230000015654 memory Effects 0.000 claims description 6
- 239000011295 pitch Substances 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 2
- 230000001172 regenerating effect Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 description 65
- 230000014509 gene expression Effects 0.000 description 37
- 230000015572 biosynthetic process Effects 0.000 description 33
- 238000003786 synthesis reaction Methods 0.000 description 33
- 238000012549 training Methods 0.000 description 13
- 238000013461 design Methods 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 6
- 230000008451 emotion Effects 0.000 description 6
- 239000013598 vector Substances 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 241001481828 Glyptocephalus cynoglossus Species 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Abstract
The present disclosure provides methods and apparatus for automatic audio content generation. Text may be obtained. A context corresponding to the text may be constructed. Reference factors can be determined based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text. A speech waveform corresponding to the text may be generated based at least on the text and the reference factors.
Description
Background
Text-to-speech (TTS) synthesis aims at generating corresponding speech waveforms based on text input. Conventional TTS models or systems may predict acoustic features based on text input and, in turn, generate speech waveforms based on the predicted acoustic features. The TTS model may be applied to convert various types of text contents into audio contents, for example, converting a book in a text format into an audio book (audiobook), and the like.
Disclosure of Invention
This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Embodiments of the present disclosure propose methods and apparatuses for automatic audio content generation. Text may be obtained. A context corresponding to the text may be constructed. Reference factors can be determined based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text. A speech waveform corresponding to the text may be generated based at least on the text and the reference factors.
It should be noted that one or more of the above aspects include features that are specifically pointed out in the following detailed description and claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.
Drawings
The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate, but not to limit, the disclosed aspects.
Fig. 1 shows an exemplary conventional TTS model.
Fig. 2 illustrates an exemplary process of automatic audio content generation according to an embodiment.
Fig. 3 illustrates an exemplary process of automatic audio content generation according to an embodiment.
Fig. 4 shows an exemplary process of preparing training data according to an embodiment.
Fig. 5 illustrates an exemplary process of predicting role categories and styles, according to an embodiment.
FIG. 6 illustrates an exemplary implementation of speech synthesis using a TTS model based on language features, according to an embodiment.
Fig. 7 shows an exemplary implementation of an encoder in a TTS model based on language features according to an embodiment.
FIG. 8 illustrates an exemplary implementation of speech synthesis using a context-based TTS model, according to an embodiment.
Fig. 9 shows an exemplary implementation of a context encoder in a context-based TTS model.
Fig. 10 illustrates an exemplary process of predicting roles and selecting a TTS model according to an embodiment.
FIG. 11 illustrates an exemplary implementation of speech synthesis using a TTS model based on language features, according to an embodiment.
FIG. 12 illustrates an exemplary implementation of speech synthesis using a context-based TTS model, according to embodiments.
Fig. 13 illustrates an exemplary process of updating audio content according to an embodiment.
Fig. 14 shows a flow of an exemplary method for automatic audio content generation, according to an embodiment.
Fig. 15 illustrates an exemplary apparatus for automatic audio content generation, according to an embodiment.
Fig. 16 illustrates an exemplary apparatus for automatic audio content generation, according to an embodiment.
Detailed Description
The present disclosure will now be discussed with reference to various exemplary embodiments. It is to be understood that the discussion of these embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and does not teach any limitation as to the scope of the present disclosure.
Audio books are increasingly commonly used for entertainment and education. Traditional audio books are recorded manually. For example, a professional speaker (narrator) or a dubbing actor (voice actor) speaks a text content prepared in advance, and a vocal book corresponding to the text content is obtained by recording the lecture of the speaker. Such a recording of an audio book would be very time consuming and costly, resulting in an inability to obtain a corresponding audio book for a large number of text books in a timely manner.
TTS synthesis can improve the efficiency of production of voiced books and reduce costs. Most TTS models synthesize speech separately for each text sentence. Speech synthesized in this way typically has, for example, a single prosody of the utterance and is thus tedious to sound. When such a single prosody of speech is repeatedly applied to the entire passage of a voiced book, the quality of the voiced book will be significantly reduced. In particular, if TTS synthesis is used only for a single speaker's voice for the entire voiced book, a monotonous phonetic representation will further reduce the appeal of the voiced book.
Embodiments of the present disclosure propose performing automatic and high quality audio content generation for textual content. In this context, text content may broadly refer to any content in text form, such as books, scripts, articles, etc., while audio content may broadly refer to any content in audio form, such as audio books, dubbing of videos, news broadcasts, etc. Although the conversion of a text storybook into an audio book is exemplified in various portions of the following discussion, it should be understood that embodiments of the present disclosure can also be applied to the conversion of any other form of text content into any other form of audio content.
Embodiments of the present disclosure may build a context for one text sentence in the text content and use the context for TTS synthesis of the text sentence, rather than just considering the text sentence itself in the TTS synthesis. Generally, the context of a text sentence can provide rich expression information about the text sentence, which can be used as a reference factor for TTS synthesis, so that the synthesized speech is more expressive, more vivid, more diverse, and the like. Various reference factors, such as a role, a role category, a style, a role character, etc., corresponding to the text statement may be determined based on the context. In this context, a character may refer to a specific character, anthropomorphic animal, anthropomorphic object, etc. having conversation capability that appears in text content. For example, assuming that the textual content relates to a story that occurs between two people named "Mike" and "Mary," it can be considered that "Mike" and "Mary" are two characters in the textual content. For example, assuming that the textual content relates to a story occurring between queen, princess, and witch, "queen," "princess," and "witch" may be considered characters in the textual content. The character category may refer to a category attribute of the character, such as gender, age, and the like. Style may refer to an emotional type, e.g., happy, sad, etc., to which the text statement corresponds. Character personality may refer to a personality that is modeled for a character in the text content, e.g., gentle, clear, nefarious, etc. By taking these reference factors into account in TTS synthesis, it is possible to synthesize voices having different voice characteristics, such as voices having different timbres, voice styles, and the like, respectively, for different characters, character types, styles, and the like. Thus, expressiveness, vividness, and diversity of the synthesized speech can be enhanced, thereby significantly improving the quality of the synthesized speech.
In one aspect, embodiments of the present disclosure may be applied to a scenario in which audio content is synthesized with the voice of a single speaker, which may also be referred to as single speaker audio content generation. The individual speaker may be a pre-designated target speaker, and the voice of the target speaker may be employed to simulate or play different types of roles in the textual content. The character categories may be considered in speech synthesis so that speech corresponding to different character categories, e.g., speech corresponding to young males, speech corresponding to older females, etc., may be generated using the target speaker's voice. Alternatively, styles may also be considered in speech synthesis so that different styles of speech may be generated using the target speaker's voice, e.g., speech corresponding to the emotion type "happy," speech corresponding to the emotion type "sad," etc. By considering the character category and style in the scene of the single speaker audio content generation, the expressiveness, vividness, and the like of the synthesized voice can be enhanced.
In one aspect, embodiments of the present disclosure may be applied to a scenario in which audio content is synthesized with the voices of multiple speakers, which may also be referred to as multi-speaker audio content generation. The voices of different speakers may be used separately for different characters in the text content. These speakers may be predetermined candidates of speakers having different attributes. For a particular character, the speaker's voice can be determined with reference to at least the character category, character personality, and the like. For example, assuming the character Mike is a young male with a good personality, the voice of a speaker with attributes of < young >, < male >, < bright >, etc. may be selected to generate Mike's voice. By automatically assigning different speakers' voices to different characters in speech synthesis, the diversity of audio contents and the like can be enhanced. Alternatively, styles may also be considered in speech synthesis, so that voices of different styles may be generated using voices of speakers corresponding to characters, thereby enhancing expressiveness, vividness, and the like of the synthesized voices. .
Embodiments of the present disclosure may employ various TTS models to synthesize speech in consideration of the above-mentioned reference factors. In one aspect, a TTS model based on language (linguistic) features may be employed. In the scenario of single speaker audio content generation, a speech feature-based TTS model may be trained with a corpus of voices of the target speaker, where the model may generate speech taking into account at least role categories, optional styles, and the like. In a scenario of multi-speaker audio content generation, different versions of a TTS model based on linguistic features may be trained with different speaker candidate voice corpora, and a corresponding version of the model may be selected for a particular character to generate speech for that character, or further different styles of speech may be generated for that character by considering style. In another aspect, a context-based TTS model may be employed. In the scenario of single speaker audio content generation, a context-based TTS model may be trained with a corpus of voices of target speakers, where the model may generate speech taking into account at least the context of the text sentences, role categories, optional styles, and the like. In a scenario of multi-speaker audio content generation, different versions of a context-based TTS model may be trained with different speaker candidate voice corpora, and a corresponding version of the model may be selected for a particular character to generate speech for that character, or further different styles of speech may be generated for that character by considering style.
Embodiments of the present disclosure also provide a flexible customization mechanism for audio content. For example, a user may adjust or customize audio content through a visual customization platform. Various parameters involved in speech synthesis may be modified or set to adjust any portion of the audio content so that a particular utterance in the audio content may have a desired character category, a desired style, and the like. Since the TTS model based on language features has explicit feature input, it can be used to update audio content in response to a user's adjustment indication.
Embodiments of the present disclosure can flexibly use a language feature-based TTS model and/or a context-based TTS model to automatically generate high-quality audio content. The TTS model based on language features may generate high quality speech by considering reference factors determined based on context, and may be used to adjust or update the generated audio content. The context-based TTS model considers not only the reference factors determined based on the context but also the context features extracted from the context itself in speech synthesis, so that the speech synthesis for long texts can be more coordinated. In the audio content generated according to the embodiment of the present disclosure, the words of the character will have stronger expressive power, vividness and diversity, so that the attraction, interest and the like of the audio content can be significantly improved. Automatic audio content generation according to embodiments of the present disclosure is fast and low cost. Furthermore, since the embodiments of the present disclosure convert text contents into high-quality audio contents in a fully automatic manner, the barrier to audio content creation is further lowered, so that not only professional dubbing actors but also general users can conveniently and quickly perform their own unique audio content creation.
FIG. 1 shows an exemplary conventional TTS model 100.
The encoder 112 may transform the information contained in the text 102 into a space that is more robust and more suitable for learning alignment with acoustic features. For example, the encoder 112 may convert information in the text 102 into a sequence of states in the space, which may also be referred to as an encoder state sequence. Each state in the sequence of states corresponds to a phoneme, grapheme, character, etc. in the text 102.
The attention module 114 may implement an attention mechanism. This attention mechanism establishes a connection between the encoder 112 and the decoder 116 to facilitate alignment between the text features and the acoustic features output by the encoder 112. For example, a connection between each decoding step and an encoder state may be established, which may indicate to which encoder state each decoding step should correspond with what weight. The attention module 114 may take as input the encoder state sequence and the output of the previous step of the decoder and generate an attention vector representing the weights with which the next decoding step is aligned to each encoder state.
The decoder 116 may map the state sequence output by the encoder 112 to the acoustic features 106 under the influence of an attention mechanism in the attention module 114. At each decoding step, the decoder 116 may take as input the attention vector output by the attention module 114 and the output of the previous step of the decoder, and output the acoustic features of the frame or frames, e.g., mel-frequency spectra.
Fig. 2 illustrates an exemplary process 200 of automatic audio content generation, according to an embodiment. The process 200 may be applied to a scenario of single speaker audio content generation.
The text content 210 is a processing object of automatic audio content generation according to an embodiment, for example, a text storybook, and is intended to generate audio content, for example, a vocal book, by performing the process 200 on a plurality of texts included in the text content 210, respectively. It is assumed that text 212 is currently being extracted from text content 210 and is intended to generate a speech waveform corresponding to text 212 by performing process 200.
At 220, a context 222 corresponding to the text 212 may be constructed. In one implementation, context 222 may include one or more texts adjacent to text 212 in text content 210. For example, context 222 may include at least one sentence before text 212 and/or at least one sentence after text 212. Thus, context 222 is actually a sequence of sentences corresponding to text 212. Optionally, context 222 may also include more text or all text in text content 210.
At 230, a reference factor 232 may be determined based at least on the context 222. Reference factors 232 may affect the characteristics of the synthesized speech in subsequent TTS speech synthesis. Reference factors 232 may include a character category corresponding to text 212 indicating attributes such as age, gender, etc. of the character corresponding to text 212. For example, if the text 212 is an utterance spoken by a young male, the character category corresponding to the text 212 may be determined to be < young >, < male >, or the like. In general, different role categories may correspond to different speech characteristics. Optionally, reference factors 232 may also include a style corresponding to text 212 indicating, for example, what emotion type text 212 was spoken with. For example, if text 212 is an utterance spoken by a character with an angry emotion, the style corresponding to the text 212 may be determined to be < angry >. In general, different styles may correspond to different speech characteristics. The character categories and styles may affect the characteristics of the synthesized speech, either individually or in combination. In one implementation, the role categories, styles, etc. can be predicted based on the context 222 through a pre-trained predictive model at 230.
In accordance with process 200, a TTS model 240 pre-trained for a target speaker may be employed to generate speech waveforms. The target speaker may be a speaker automatically determined in advance or a speaker designated by the user. TTS model 240 may synthesize speech using the target speaker's voice. In one implementation, TTS model 240 may be a language feature-based TTS model, where, unlike conventional language feature-based TTS models, language feature-based TTS model 240 may synthesize speech in consideration of at least a reference factor. The language-feature-based TTS model 240 may generate a speech waveform 250 corresponding to the text 212 based on at least the text 212 and the role category if the reference factors 232 include the role category, or may generate a speech waveform 250 corresponding to the text 212 based on at least the text 212, the role category, and the style if the reference factors 232 include both the role category and the style. In one implementation, TTS model 240 may be a context-based TTS model, where context-based TTS model 240 may synthesize speech considering at least reference factors, unlike conventional context-based TTS models. Context-based TTS model 240 may generate speech waveform 250 corresponding to text 212 based on at least text 212, context 222, and role category where reference factors 232 include a role category, or may generate speech waveform 250 corresponding to text 212 based on at least text 212, context 222, role category, and style where reference factors 232 include both a role category and style.
In a similar manner, a plurality of speech waveforms corresponding to a plurality of texts included in the text content 210 may be generated by the process 200. All of these speech waveforms may together form audio content corresponding to the textual content 210. The audio content may include different character categories and/or different styles of speech synthesized using the target speaker's voice.
Fig. 3 illustrates an exemplary process 300 of automatic audio content generation according to an embodiment. The process 300 may be applied to a scenario of multi-speaker audio content generation.
Assume that the current text 312 is taken from the text content 310 and that the speech waveform corresponding to the text 312 is intended to be generated by performing the process 300.
At 320, a context 322 corresponding to the text 312 can be constructed. In one implementation, context 322 may include one or more texts adjacent to text 312 in text content 310. Optionally, context 322 may also include more text or all text in textual content 310.
At 330, reference factors 332 can be determined based at least on the context 322, which are used to influence the characteristics of the synthesized speech in the subsequent TTS speech synthesis. Reference 332 may include a role category corresponding to text 312. Reference 332 may also include a character personality corresponding to text 312 that indicates the personality of the character to which text 312 corresponds. For example, if the text 312 is an utterance spoken by an evil enchanter's witch, the character personality corresponding to the text 312 may be determined to be < evil >. In general, different personalities may correspond to different voice characteristics. Reference 332 may also include a character corresponding to text 312 indicating which character in textual content 310 text 312 was spoken by. In general, different characters may employ different sounds. Optionally, reference 332 may also include a style corresponding to text 312. In one implementation, different reference factors may be predicted based on context 322 by different pre-trained predictive models at 330. These predictive models may include, for example, predictive models for predicting character categories and styles, predictive models for predicting character persona, predictive models for predicting characters, and the like.
In accordance with process 300, a TTS model to be used may be selected at 340 from a library 350 of pre-prepared candidate TTS models. The candidate TTS model library 350 may include a plurality of candidate TTS models pre-trained for different candidate speakers. Each candidate speaker may have attributes in terms of at least one of a role category, a role personality, a role, and the like. For example, attributes of candidate speaker 1 may include < old >, < female >, < nefarious > and < witch >, attributes of candidate speaker 2 may include < middle age >, < male > and < kairan >, and so on. Candidate speakers corresponding to text 312 may be determined using at least one of the role category, role character, and role in reference factors 332, and the TTS model corresponding to the determined candidate speakers may be selected accordingly.
Assume that TTS model 360 is selected from candidate TTS model library 350 at 340 for generating speech waveforms for text 312. TTS model 360 may synthesize speech using the voices of the speaker corresponding to the model. In one implementation, TTS model 360 may be a language feature-based TTS model that may generate speech waveforms 370 corresponding to text 312 based at least on text 312. Where reference factors 332 include style, speech waveform 370 may be generated by speech feature-based TTS model 360 further based on style. In one implementation, TTS model 360 may be a context-based TTS model that may generate speech waveform 370 corresponding to text 312 based at least on text 312 and context 322. Where reference factors 332 include style, speech waveform 370 may be generated by context-based TTS model 360 further based on style.
In a similar manner, a plurality of speech waveforms corresponding to a plurality of texts included in the text content 310 may be generated by the process 300. All of these speech waveforms may together form audio content corresponding to textual content 310. The audio content may include speech synthesized with the voices of different speakers automatically assigned to different roles, and optionally, the speech may have different styles.
Fig. 4 illustrates an exemplary process 400 of preparing training data according to an embodiment.
Sets of matching audio content 402 and text content 404, e.g., vocal books and corresponding text storybooks, can be obtained in advance.
At 410, automatic segmentation may be performed on the audio content 402. For example, the audio content 402 may be automatically divided into a plurality of audio segments, each of which may correspond to one or more speech utterances. The automatic segmentation at 410 may be performed by any known audio segmentation technique.
At 420, post-processing may be performed on the divided plurality of audio segments using the textual content 404. In one aspect, post-processing at 420 may include utilizing the textual content 404 for utterance integrity re-segmentation. For example, the textual content 404 may be readily divided into a plurality of textual statements by any known text segmentation technique, and the audio segment corresponding to each textual statement is then determined with reference to the textual statement. For each text statement, one or more of the audio segments obtained at 410 may be segmented or combined to match the text statement. Accordingly, it is possible to achieve alignment of the audio segment with the text sentence at the point of time and form a plurality of < text sentence, audio segment > pairs. In another aspect, post-processing at 420 may include classifying the audio segment into voice-over and dialogue. For example, by performing a classification process for recognizing the voice-over and the dialogue on a text sentence, an audio segment corresponding to the text sentence can be classified into the voice-over and the dialogue.
At 430, a tag may be added for the < text statement, audio segment > pair relating to the dialog. The indicia may include, for example, a role category, a genre, and the like. In one case, the role category, style, etc. of each < text sentence, audio segment > pair can be determined by means of automatic clustering. In another case, the role category, style, etc. of each < text sentence, audio segment > pair may be artificially labeled.
Through the process 400, a set of labeled training data 406 may be ultimately obtained. Each piece of training data may have a form such as < text sentence, audio segment, character category, style >. The training data 406 may in turn be applied to train a prediction model for predicting a character category and style corresponding to text, a TTS model for generating speech waveforms, and the like.
It should be appreciated that the above process 400 is merely exemplary, and that any other form of training data may be prepared in a similar manner depending on the particular application scenario and design. For example, the indicia added at 430 can also include roles, role characters, and the like.
Fig. 5 illustrates an exemplary process 500 of predicting role categories and styles, according to an embodiment. Process 500 may be performed by a predictive model for predicting role categories and styles. The predictive model may automatically assign role categories and styles to the text.
For text 502, a context corresponding to text 502 can be constructed at 510. The processing at 510 may be similar to the processing at 220 in fig. 2.
The constructed context may be provided to a pre-trained language model 520. Language model 520 is used to model and express textual information, which may be trained to generate an implicit spatial expression (e.g., an embedded expression) for the input text. Language model 520 may be based on any suitable technique, such as bi-directional encoder expressions (BERT) from transformers, and the like.
The embedded expressions output by the language model 520 may be provided to a mapping (projection) layer 530 and a softmax layer 540, in that order. Mapping layer 530 may convert the embedded expression into a mapped expression, and softmax layer 540 may calculate probabilities of different character categories and probabilities of different styles based on the mapped expression, thereby ultimately determining character category 504 and style 506 corresponding to text 502.
The predictive model used to perform process 500 may be trained using training data obtained through process 400 of fig. 4. For example, the predictive model may be trained using training data in the form of < text, role category, style >. In the stage of applying the trained predictive model, the predictive model may predict a character category and style corresponding to the text based on the input text.
Although the predictive models described above in connection with process 500 may jointly predict character categories and styles, it should be understood that separate predictive models may be employed to predict character categories and styles, respectively, by a similar process.
FIG. 6 illustrates an exemplary implementation 600 for speech synthesis using a TTS model based on language features, according to an embodiment. This implementation 600 may be applied to the scenario of the generation of the single speaker audio content, which may be considered an exemplary specific implementation of the process 200 of FIG. 2. Assume that in fig. 6 it is desired to generate speech waveforms for text 602.
At 630, front end analysis may be performed on the text 602 to extract phoneme features 632 and prosody (prosody) features 634. The front-end analysis at 630 may be performed using any known TTS front-end analysis technique. The phoneme feature 632 may refer to a sequence of phonemes extracted from the text 602. Prosodic features 634 may refer to prosodic information corresponding to text 602, such as pauses (break), accents (accent), rates, and so forth.
The phoneme features 632 and prosodic features 634 may be encoded using an encoder 640. The encoder 640 may be based on any architecture. As an example, one example of an encoder 640 is given in fig. 7. Fig. 7 shows an exemplary implementation of the encoder 710 in a speech feature based TTS model according to an embodiment. The encoder 710 may correspond to the encoder 640 in fig. 6. The encoder 710 may encode the phoneme features 702 and the prosodic features 704, where the phoneme features 702 and the prosodic features 704 may correspond to the phoneme features 632 and the prosodic features 634 in fig. 6, respectively. The phoneme features 702 and prosodic features 704 may be feature extracted by processing through a 1-D convolution filter 712, a max-pooling layer 714, a 1-D convolution map 716, respectively. At 718, the output of the 1-D convolution mapping 716 may be superimposed (add) with the phoneme features 702 and prosodic features 704. The superimposed output at 718 may then be processed through a high speed network layer 722 and a Bidirectional Long Short Term Memory (BLSTM) layer 724 to obtain an encoder output 712. It should be understood that the architecture and all components in fig. 7 are exemplary, and that the encoder 710 may have any other implementation depending on the particular needs and design.
According to process 600, at 642, the output of encoder 640 and the implicit expression obtained at 620 can be superimposed.
The language feature based TTS model in FIG. 6 may be used for single speaker audio content generation. Thus, the TTS model may be trained for the target speaker 604. The speaker-embedded representation 606 corresponding to the target speaker 604 may be obtained, for example, by a speaker-embedded LUT, and the speaker-embedded representation 606 may be used to influence the TTS model for speech synthesis with the voice of the target speaker 604. At 644, the superimposed output at 642 may be concatenated with the speaker-embedded representation 606, and the concatenated output may be provided to attention module 650.
The decoder 660 may generate acoustic features, e.g., mel-frequency spectral features, etc., under the influence of the attention module 650. The vocoder 670 may generate the speech waveform 608 corresponding to the text 602 based on the acoustic features.
FIG. 8 illustrates an exemplary implementation 800 for speech synthesis using a context-based TTS model according to an embodiment. This implementation 800 may be applied to the scenario of the generation of the single speaker audio content, which may be considered an exemplary specific implementation of the process 200 of FIG. 2. Implementation 800 is similar to implementation 600 in FIG. 6, except that the TTS model employs context coding. Assume that in fig. 8 it is desired to generate a speech waveform for text 802.
Phoneme features 832 may be extracted by performing a front end analysis (not shown) on the text 802. Encoding of the phoneme features 832 may be performed using the phoneme encoder 830. The phoneme coder 830 may be similar to the coder 640 of fig. 6, except that only phoneme features are taken as input.
According to process 800, at 852, the output of the phoneme encoder 830, the output of the context encoder 840, and the implicit expression obtained at 820 can be superimposed. In addition, the output of the context encoder 840 may also be provided to the attention module 844.
The language feature based TTS model in FIG. 8 may be used for single speaker audio content generation. Thus, the TTS model may be trained for the target speaker 804. A speaker-embedded representation 806 corresponding to the target speaker 804 may be obtained for influencing the TTS model for speech synthesis with the voice of the target speaker 804. At 854, the superimposed output at 852 can be concatenated with speaker embedded representation 806, and the concatenated output can be provided to attention module 860.
At 870, the output of attention module 844 may be concatenated with the output of attention module 860 to affect the generation of acoustic features at decoder 880. Vocoder 890 may generate speech waveform 808 corresponding to text 802 based on the acoustic features.
The model training may be performed in fig. 8 using at least training data obtained by, for example, process 400 of fig. 4. It should be appreciated that alternatively, in the actual application phase, the input of speaker-embedded expressions may be omitted, as the TTS model has been trained to synthesize speech based on the voice of the target speaker. Further, it should be appreciated that any of the components and processes in implementation 800 are exemplary and that implementation 800 may be modified in any manner depending on the particular needs and design.
Fig. 10 illustrates an exemplary process 1000 of predicting roles and selecting a TTS model, according to an embodiment. Process 1000 may be performed in the context of multi-speaker audio content generation, which is an exemplary implementation of at least a portion of process 300 in FIG. 3. Process 1000 may be used to determine a particular character corresponding to text and select a TTS model trained based on the voice of the speaker corresponding to the character.
The text 1002 is from text content 1004. At 1010, a context corresponding to the text 1002 can be constructed. At 1020, an embedded representation of the context may be generated by, for example, a pre-trained language model.
At 1030, a plurality of candidate characters can be extracted from the textual content 1004. Assuming that the textual content 1004 is a textual storybook, all characters involved in the textual storybook can be extracted at 1030 to form a list of candidate characters. Candidate role extraction at 1030 can be performed by any known technique.
At 1040, context-based candidate feature extraction may be performed. For example, for current text 1002, one or more candidate features may be extracted from the context for each candidate character. Assuming a total of N candidate roles, N candidate feature vectors may be obtained at 1040, where each candidate feature vector includes candidate features extracted for one candidate role. Various types of features may be extracted at 1040. In one implementation, the extracted features may include the number of words that are spaced between the current text and the candidate character. Since the name of a character typically occurs near the utterance of that character, this feature helps determine whether the current text was spoken by some candidate character. In one implementation, the extracted features may include the number of times the role candidate occurs in the context. The features may reflect the relative importance of a particular candidate character in the textual content. In one implementation, the extracted features may include binary features indicating whether the names of candidate characters appear in the current text. In general, a character is unlikely to mention its name in a spoken utterance. In one implementation, the extracted features may include binary features indicating whether the names of candidate characters appear in the closest preceding text or in the closest subsequent text. Since the speaker-alternating pattern is often employed in conversations between two characters, it is highly likely that the character speaking the current text will appear in the closest preceding text or the closest following text. It should be understood that the extracted features may also include any other features that are useful in determining, for example, a character corresponding to text 1002.
At 1050, the context-embedded expression generated at 1020 may be combined with all of the candidate feature vectors extracted at 1040 to form a candidate feature matrix corresponding to all of the candidate roles.
According to process 1000, a role 1062 corresponding to text 1002 may be determined from a plurality of candidate roles based at least on context through a learning ranking (LTR) model 1060. For example, the LTR model 1060 may rank the plurality of candidate characters based on a candidate feature matrix obtained from the context, and determine the highest ranked candidate character as the character 1062 corresponding to the text 1002. The LTR model 1060 can be constructed using various techniques, such as sorting Support Vector Machines (SVMs), sorting networks (RankNet), ordered (Ordinal) classification, and the like. It should be understood that the LTR model 1060 may be considered herein as a prediction model for predicting a character based on context, or more broadly, the combination of the LTR model 1060 and steps 1010, 1020, 1030, 1040, and 1050 may be considered as a prediction model for predicting a character based on context.
According to process 1000, a character personality 1072 of a character corresponding to text 1002 may optionally be predicted based at least on context by personality prediction model 1070. For example, the character prediction model 1070 may predict the character 1072 based on a candidate feature matrix obtained from the context. The personality prediction model 1070 may be constructed based on a process similar to process 500 of fig. 5, except that it is trained for the role personality classification task using text and role personality training data pairs.
In accordance with process 1000, a TTS model 1090 to be used may be selected at 1080 from a library of pre-prepared candidate TTS models 1082. The candidate TTS model library 1082 may include a plurality of candidate TTS models pre-trained for different candidate speakers. Each candidate speaker may have attributes in terms of at least one of a role category, a role personality, a role, and the like. At 1080, a candidate speaker corresponding to text 1002 may be determined with at least one of role 1062, role personality 1072, and role category 1006, and a TTS model 1090 corresponding to the determined candidate speaker may be selected accordingly. Role categories 1006 can be determined by, for example, process 500 of fig. 5.
It should be understood that any of the steps and processes in process 1000 are exemplary and that process 1000 may be modified in any manner depending on the particular needs and design.
FIG. 11 illustrates an exemplary implementation 1100 for speech synthesis using a TTS model based on language features, according to an embodiment. This implementation 1100 may be applied to a scenario of multi-speaker audio content generation, which may be considered an exemplary specific implementation of the process 300 of FIG. 3. Assume that in fig. 11 it is desired to generate speech waveforms for text 1102.
Character attributes 1112 corresponding to text 1102 may be predicted by prediction model 1110. The prediction model 1110 may correspond to, for example, the personality prediction model 1070 in fig. 10. Character 1122 corresponding to text 1102 may be predicted by predictive model 1120. Predictive model 1120 may correspond to a predictive model for predicting a character, such as described above in connection with fig. 10. The character category 1132 and style 1134 corresponding to the text 1102 may be predicted by the prediction model 1130. The predictive model 1130 may be constructed based on, for example, the process 500 of FIG. 5. At 1136, an implicit expression corresponding to the style 1134 may be generated. At 1140, front end analysis may be performed on the text 1102 to extract phoneme features 1142 and prosodic features 1144. The phoneme features 1142 and prosody features 1144 may be encoded using the encoder 1150. At 1152, the output of the encoder 1150 and the implicit expression obtained at 1136 can be superimposed.
The language feature based TTS model in FIG. 11 may be used for multi-speaker audio content generation. Candidate speaker 1104 corresponding to character 1122 may be determined based on at least one of character 1122, character personality 1112, and character category 1132 in a manner similar to process 1000 of FIG. 10. The TTS model may be trained for candidate speakers 1104. The speaker-embedded expressions 1106 corresponding to the candidate speakers 1104 may be obtained, for example, by speaker-embedded LUTs, and the speaker-embedded expressions 1106 may be used to influence the TTS model for speech synthesis with the voice of the candidate speakers 1104. At 1154, the superimposed output at 1152 may be concatenated with the speaker-embedded expression 1106, and the concatenated output may be provided to an attention module 1160.
The decoder 1170 may generate acoustic features under the influence of the attention module 1160. The vocoder 1180 may generate the speech waveform 1108 corresponding to the text 1102 based on the acoustic features.
The implementation 1100 in FIG. 11 is intended to illustrate an exemplary architecture for speech synthesis using a TTS model based on linguistic features. A plurality of candidate TTS models can be obtained by constructing corresponding TTS models for different candidate speakers. In a practical application stage, a candidate speaker corresponding to a character may be determined based on at least one of the character, character personality and character category, and a TTS model trained for the candidate speaker may be selected for generating a speech waveform. Further, it should be appreciated that any of the components and processes in implementation 1100 are exemplary and that implementation 1100 may be altered in any manner depending on the particular needs and design.
FIG. 12 illustrates an exemplary implementation 1200 of speech synthesis using a context-based TTS model according to an embodiment. This implementation 1200 may be applied to a scenario of multi-speaker audio content generation, which may be considered an exemplary specific implementation of the process 300 of fig. 3. Implementation 1200 is similar to implementation 1100 in FIG. 11, except that the TTS model employs context coding. Assume that in fig. 12 it is desired to generate a speech waveform for text 1202.
Phoneme features 1242 may be extracted by performing a front end analysis (not shown) on the text 1202. Encoding may be performed on the phoneme features 1242 using the phoneme encoder 1240. The phoneme encoder 1240 may be similar to the phoneme encoder in fig. 8. Context information 1252 may be extracted from text 1202 and context information 1252 may be encoded using context encoder 1250. The context encoder 1250 may be similar to the context encoder 840 in fig. 8.
At 1262, the output of the phoneme encoder 1240, the output of the context encoder 1250, and the implicit expression obtained at 1236 may be superimposed. The output of the context encoder 1250 may also be provided to an attention module 1254.
The context-based TTS model in FIG. 12 may be used for multi-speaker audio content generation. Candidate speakers 1204 corresponding to role 1222 may be determined based on at least one of role 1222, role personality 1212, and role category 1232 in a manner similar to process 1000 of fig. 10. The TTS model may be trained for the candidate speakers 1204. Speaker-embedded expressions 1206 corresponding to the speaker candidates 1204 may be used to influence the TTS model for speech synthesis with the voices of the speaker candidates 1204. At 1264, the superimposed output at 1262 may be concatenated with the speaker-embedded representation 1206, and the concatenated output may be provided to an attention module 1270.
At 1272, the output of attention module 1254 may be cascaded with the output of attention module 1270 to affect the generation of acoustic features at decoder 1280. The vocoder 1290 may generate a speech waveform 1208 corresponding to the text 1202 based on the acoustic features.
According to embodiments of the present disclosure, audio content may also be customized. For example, a speech waveform in the generated audio content may be adjusted to update the audio content.
Fig. 13 illustrates an exemplary process 1300 of updating audio content according to an embodiment.
Assume that a user provides textual content 1302 and wants to obtain audio content corresponding to the textual content 1302. Audio content 1304 corresponding to text content 1302 may be created by performing audio content generation at 1310. The audio content generation at 1310 may be based on any implementation of the automatic audio content generation according to embodiments of the present disclosure described above in connection with fig. 2-12.
The audio content 1304 may be provided to a customization platform 1320. The customization platform 1320 may include a user interface for interacting with a user. Through the user interface, audio content 1304 may be provided and presented to a user, and an indication 1306 of a user's adjustment to at least a portion of the audio content may be received. For example, if the user is not satisfied with a certain utterance in the audio content 1304 or wants to modify the utterance to a desired character category, a desired style, etc., the user may enter an adjustment indication 1306 through the user interface.
In accordance with process 1300, in response to adjustment indication 1306, customization platform 1320 may invoke TTS model 1330 to regenerate the speech waveform. Assuming that the adjustment indication 1306 is for a certain utterance or corresponding speech waveform in the audio content 1304, text corresponding to the speech waveform may be provided to the TTS model 1330 along with the adjustment information in the adjustment indication. TTS model 1330 may in turn regenerate speech waveform 1332 for the text conditioned on the adjustment information. Taking the example where the adjustment indication 1306 includes adjustment information regarding a role category, the role category specified in the adjustment indication 1306 may be utilized in place of, for example, the role category determined in fig. 2, and further a speech waveform may be generated by the TTS model. In one implementation, the TTS model 1330 may employ a language feature-based TTS model, since the language feature-based TTS model has explicit feature inputs that can be controlled by parameters corresponding to the adjustment indication.
The previous speech waveform in audio content 1304 may be replaced with the regenerated speech waveform 1332 to form updated audio content 1308.
Fig. 14 shows a flow of an exemplary method 1400 for automatic audio content generation, according to an embodiment.
At 1410, text may be obtained.
At 1420, a context corresponding to the text can be constructed.
At 1430, reference factors can be determined based at least on the context, the reference factors including at least a role category and/or role corresponding to the text.
At 1440, a speech waveform corresponding to the text can be generated based at least on the text and the reference factor.
In one implementation, the reference factor may further include a style corresponding to the text.
In one implementation, the determining the reference factor may include: predicting, by a prediction model, the role category based at least on the context.
The generating of the voice waveform may include: generating the speech waveform based on at least the text and the role category through a TTS model based on language features. The TTS model based on the language features may be pre-trained for a target speaker.
The generating of the voice waveform may include: generating, by a context-based TTS model, the speech waveform based on at least the text, the context, and the role category. The context-based TTS model may be pre-trained for a target speaker.
In one implementation, the determining the reference factor may include: extracting a plurality of candidate characters from text content including the text; and determining, by an LTR model, the role from the plurality of candidate roles based at least on the context.
In one implementation, the generating the speech waveform may include: selecting a TTS model corresponding to the role from a plurality of candidate TTS models trained in advance, wherein the plurality of candidate TTS models are respectively trained in advance for different speakers; and generating the speech waveform through the selected TTS model.
The determining the reference factor may include: predicting, by a first prediction model, the role category based at least on the context; predicting, by a second predictive model, the role based at least on the context; and predicting, by a third predictive model, a character personality based at least on the context. The selecting a TTS model may include: selecting the TTS model from the plurality of candidate TTS models based on at least one of the role, the role category, and the role character.
The TTS model selected may be a language feature-based TTS model, and the generating the speech waveform may include: generating, by the language feature based TTS model, the speech waveform based at least on the text.
The selected TTS model may be a context-based TTS model, and the generating the speech waveform may include: generating, by the context-based TTS model, the speech waveform based on at least the text and the context.
In one implementation, the speech waveform may be generated further based on a style corresponding to the text.
In one implementation, the method 1400 may further include: receiving an adjustment indication for the speech waveform; and in response to the adjustment indication, regenerating a speech waveform corresponding to the text through a TTS model based on language features.
The adjustment indication may comprise at least one of: adjustment information on prosodic information, the prosodic information including at least one of pauses, accents, pitches, rates; adjustment information about pronunciation; adjustment information regarding the role category; adjustment information about the style; and adjustment information regarding the acoustic parameter.
It should be understood that method 1400 may also include any of the steps/processes for automatic audio content generation according to embodiments of the present disclosure described above.
Fig. 15 illustrates an exemplary apparatus 1500 for automatic audio content generation, according to an embodiment.
The apparatus 1500 may include: a text obtaining module 1510 configured to obtain a text; a context construction module 1520 for constructing a context corresponding to the text; a reference factor determination module 1530 for determining reference factors based on at least the context, the reference factors including at least a role category and/or a role corresponding to the text; and a speech waveform generation module 1540 for generating a speech waveform corresponding to the text based on at least the text and the reference factor.
In one implementation, the reference factor determination module 1530 may be configured to: predicting, by a prediction model, the role category based at least on the context.
The speech waveform generation module 1540 may be configured to: generating the speech waveform based on at least the text and the role category through a TTS model based on language features. The TTS model based on the language features may be pre-trained for a target speaker.
The speech waveform generation module 1540 may be configured to: generating, by a context-based TTS model, the speech waveform based on at least the text, the context, and the role category. The context-based TTS model may be pre-trained for a target speaker.
In one implementation, the reference factor determination module 1530 may be configured to: extracting a plurality of candidate characters from text content including the text; and determining, by an LTR model, the role from the plurality of candidate roles based at least on the context.
In one implementation, the voice waveform generation module 1540 may be configured to: selecting a TTS model corresponding to the role from a plurality of candidate TTS models trained in advance, wherein the plurality of candidate TTS models are respectively trained in advance for different speakers; and generating the speech waveform through the selected TTS model.
Furthermore, the apparatus 1500 may also include any other modules that perform the steps of the method for automatic audio content generation according to embodiments of the present disclosure described above.
Fig. 16 illustrates an exemplary apparatus 1600 for automatic audio content generation, according to an embodiment.
Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the operations of the method for automatic audio content generation according to embodiments of the present disclosure described above.
It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of the operations, but rather should encompass all other equivalent variations under the same or similar concepts.
It should also be understood that all of the modules in the above described apparatus may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.
The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. By way of example, the processor, any portion of the processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Programmable Logic Device (PLD), state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.
Software should be viewed broadly as representing instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable ROM (prom), an erasable prom (eprom), an electrically erasable prom (eeprom), a register, or a removable disk. Although the memory is shown as being separate from the processor in aspects presented in this disclosure, the memory may be located internal to the processor (e.g., a cache or a register).
The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the claims.
Claims (20)
1. A method for automatic audio content generation, comprising:
obtaining a text;
constructing a context corresponding to the text;
determining reference factors based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text; and
generating a speech waveform corresponding to the text based on at least the text and the reference factor.
2. The method of claim 1, wherein,
the reference factors further include a style corresponding to the text.
3. The method of claim 1, wherein the determining a reference factor comprises:
predicting, by a prediction model, the role category based at least on the context.
4. The method of claim 3, wherein the generating a speech waveform comprises:
generating the speech waveform based on at least the text and the role category through a text-to-speech (TTS) model based on language features,
wherein the TTS model based on language features is pre-trained for a target speaker.
5. The method of claim 3, wherein the generating a speech waveform comprises:
generating, by a context-based text-to-speech (TTS) model, the speech waveform based on at least the text, the context, and the role category,
wherein the context-based TTS model is pre-trained for a target speaker.
6. The method of claim 1, wherein the determining a reference factor comprises:
extracting a plurality of candidate characters from text content including the text; and
determining the role from the plurality of candidate roles based at least on the context through a learning ranking (LTR) model.
7. The method of claim 1, wherein the generating a speech waveform comprises:
selecting a text-to-speech (TTS) model corresponding to the role from a plurality of candidate TTS models trained in advance, the plurality of candidate TTS models being pre-trained for different speakers, respectively; and
generating the speech waveform through the selected TTS model.
8. The method of claim 7, wherein the determining a reference factor comprises:
predicting, by a first prediction model, the role category based at least on the context;
predicting, by a second predictive model, the role based at least on the context; and
predicting character traits based at least on the context by a third predictive model, and
wherein the selecting a TTS model comprises: selecting the TTS model from the plurality of candidate TTS models based on at least one of the role, the role category, and the role character.
9. The method of claim 7, wherein the selected TTS model is a language feature-based TTS model, and the generating a speech waveform comprises:
generating, by the language feature based TTS model, the speech waveform based at least on the text.
10. The method of claim 7, wherein the selected TTS model is a context-based TTS model, and the generating a speech waveform comprises:
generating, by the context-based TTS model, the speech waveform based on at least the text and the context.
11. The method of any one of claims 4, 5, 9, 10,
the speech waveform is further generated based on a style corresponding to the text.
12. The method of claim 1, further comprising:
receiving an adjustment indication for the speech waveform; and
in response to the adjustment indication, regenerating a speech waveform corresponding to the text through a text-to-speech (TTS) model based on language features.
13. The method of claim 12, wherein the adjustment indication comprises at least one of:
adjustment information on prosodic information, the prosodic information including at least one of pauses, accents, pitches, rates;
adjustment information about pronunciation;
adjustment information regarding the role category;
adjustment information about the style; and
adjustment information regarding the acoustic parameter.
14. An apparatus for automatic audio content generation, comprising:
the text obtaining module is used for obtaining a text;
the context construction module is used for constructing a context corresponding to the text;
a reference factor determination module to determine reference factors based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text; and
a speech waveform generation module to generate a speech waveform corresponding to the text based at least on the text and the reference factor.
15. The apparatus of claim 14, wherein the reference factor determination module is to:
predicting, by a prediction model, the role category based at least on the context.
16. The apparatus of claim 15, wherein the speech waveform generation module is to:
generating the speech waveform based on at least the text and the role category through a text-to-speech (TTS) model based on language features,
wherein the TTS model based on language features is pre-trained for a target speaker.
17. The apparatus of claim 15, wherein the speech waveform generation module is to:
generating, by a context-based text-to-speech (TTS) model, the speech waveform based on at least the text, the context, and the role category,
wherein the context-based TTS model is pre-trained for a target speaker.
18. The apparatus of claim 14, wherein the reference factor determination module is to:
extracting a plurality of candidate characters from text content including the text; and
determining the role from the plurality of candidate roles based at least on the context through a learning ranking (LTR) model.
19. The apparatus of claim 14, wherein the speech waveform generation module is to:
selecting a text-to-speech (TTS) model corresponding to the role from a plurality of candidate TTS models trained in advance, the plurality of candidate TTS models being pre-trained for different speakers, respectively; and
generating the speech waveform through the selected TTS model.
20. An apparatus for automatic audio content generation, comprising:
at least one processor; and
a memory storing computer-executable instructions that, when executed, cause the at least one processor to:
the text is obtained and the text is obtained,
a context corresponding to the text is constructed,
determining reference factors based at least on the context, the reference factors including at least a role category and/or a role corresponding to the text, an
Generating a speech waveform corresponding to the text based on at least the text and the reference factor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387249.8A CN113628609A (en) | 2020-05-09 | 2020-05-09 | Automatic audio content generation |
PCT/US2021/028297 WO2021231050A1 (en) | 2020-05-09 | 2021-04-21 | Automatic audio content generation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387249.8A CN113628609A (en) | 2020-05-09 | 2020-05-09 | Automatic audio content generation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113628609A true CN113628609A (en) | 2021-11-09 |
Family
ID=75870784
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010387249.8A Pending CN113628609A (en) | 2020-05-09 | 2020-05-09 | Automatic audio content generation |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113628609A (en) |
WO (1) | WO2021231050A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115620699A (en) * | 2022-12-19 | 2023-01-17 | 深圳元象信息科技有限公司 | Speech synthesis method, speech synthesis system, speech synthesis apparatus, and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101156196A (en) * | 2005-03-28 | 2008-04-02 | 莱塞克技术公司 | Hybrid speech synthesizer, method and use |
CN106652995A (en) * | 2016-12-31 | 2017-05-10 | 深圳市优必选科技有限公司 | Voice broadcasting method and system for text |
CN110491365A (en) * | 2018-05-10 | 2019-11-22 | 微软技术许可有限责任公司 | Audio is generated for plain text document |
CN110634336A (en) * | 2019-08-22 | 2019-12-31 | 北京达佳互联信息技术有限公司 | Method and device for generating audio electronic book |
CN111105776A (en) * | 2018-10-26 | 2020-05-05 | 财团法人资讯工业策进会 | Audio playing device and playing method thereof |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9183831B2 (en) * | 2014-03-27 | 2015-11-10 | International Business Machines Corporation | Text-to-speech for digital literature |
US10607595B2 (en) * | 2017-08-07 | 2020-03-31 | Lenovo (Singapore) Pte. Ltd. | Generating audio rendering from textual content based on character models |
EP3824461B1 (en) * | 2018-07-19 | 2022-08-31 | Dolby International AB | Method and system for creating object-based audio content |
-
2020
- 2020-05-09 CN CN202010387249.8A patent/CN113628609A/en active Pending
-
2021
- 2021-04-21 WO PCT/US2021/028297 patent/WO2021231050A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101156196A (en) * | 2005-03-28 | 2008-04-02 | 莱塞克技术公司 | Hybrid speech synthesizer, method and use |
CN106652995A (en) * | 2016-12-31 | 2017-05-10 | 深圳市优必选科技有限公司 | Voice broadcasting method and system for text |
CN110491365A (en) * | 2018-05-10 | 2019-11-22 | 微软技术许可有限责任公司 | Audio is generated for plain text document |
CN111105776A (en) * | 2018-10-26 | 2020-05-05 | 财团法人资讯工业策进会 | Audio playing device and playing method thereof |
CN110634336A (en) * | 2019-08-22 | 2019-12-31 | 北京达佳互联信息技术有限公司 | Method and device for generating audio electronic book |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115620699A (en) * | 2022-12-19 | 2023-01-17 | 深圳元象信息科技有限公司 | Speech synthesis method, speech synthesis system, speech synthesis apparatus, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2021231050A1 (en) | 2021-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7142333B2 (en) | Multilingual Text-to-Speech Synthesis Method | |
US11929059B2 (en) | Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature | |
AU2019395322B2 (en) | Reconciliation between simulated data and speech recognition output using sequence-to-sequence mapping | |
US20230043916A1 (en) | Text-to-speech processing using input voice characteristic data | |
CN108899009B (en) | Chinese speech synthesis system based on phoneme | |
US11443733B2 (en) | Contextual text-to-speech processing | |
JP2022107032A (en) | Text-to-speech synthesis method using machine learning, device and computer-readable storage medium | |
US9368104B2 (en) | System and method for synthesizing human speech using multiple speakers and context | |
CN111954903A (en) | Multi-speaker neural text-to-speech synthesis | |
US11763797B2 (en) | Text-to-speech (TTS) processing | |
KR20230043084A (en) | Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature | |
JP4829477B2 (en) | Voice quality conversion device, voice quality conversion method, and voice quality conversion program | |
KR102062524B1 (en) | Voice recognition and translation method and, apparatus and server therefor | |
CN111681641B (en) | Phrase-based end-to-end text-to-speech (TTS) synthesis | |
CN114242033A (en) | Speech synthesis method, apparatus, device, storage medium and program product | |
CN115101046A (en) | Method and device for synthesizing voice of specific speaker | |
CN113628609A (en) | Automatic audio content generation | |
Nitisaroj et al. | The Lessac Technologies system for Blizzard Challenge 2010 | |
CN117992169A (en) | Plane design display method based on AIGC technology | |
CN115346512A (en) | Multi-emotion voice synthesis method based on digital people | |
CN114267326A (en) | Training method and device of voice synthesis system and voice synthesis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |