CN113327580A

CN113327580A - Speech synthesis method, device, readable medium and electronic equipment

Info

Publication number: CN113327580A
Application number: CN202110609251.XA
Authority: CN
Inventors: 吴鹏飞; 潘俊杰; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-08-31

Abstract

The present disclosure relates to a speech synthesis method, apparatus, readable medium and electronic device, and relates to the technical field of electronic information processing, wherein the method comprises: the method comprises the steps of obtaining a text to be synthesized and an appointed emotion type, extracting a phoneme sequence corresponding to the text to be synthesized, inputting the phoneme sequence and the appointed emotion type into a pre-trained voice synthesis model to obtain target audio output by the voice synthesis model, wherein the target audio corresponds to the text to be synthesized and has the appointed emotion type, an audio frame corresponding to each phoneme in the target audio is matched with acoustic features corresponding to the phonemes in an acoustic feature sequence, the acoustic feature sequence is determined by the voice synthesis model according to the phoneme sequence, the acoustic feature sequence comprises the acoustic features corresponding to the phonemes, and the acoustic features are used for indicating prosodic features of the phonemes. The present disclosure can control emotion that a target audio has by specifying an emotion type, while controlling prosody at a phoneme level in the target audio by acoustic features.

Description

Speech synthesis method, device, readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of electronic information processing technologies, and in particular, to a speech synthesis method, apparatus, readable medium, and electronic device.

Background

With the continuous development of electronic information processing technology, voice is widely used in daily life and work as an important carrier for people to obtain information. In an application scenario involving speech, processing of speech synthesis is usually included, and speech synthesis refers to synthesizing text designated by a user into audio. In the speech synthesis process, a specific emotion type can be designated to synthesize speech with corresponding emotion. In a real speaking scene, the same emotion also has different emotional intensity and emotional expression, for example, a speaker reads the same speech twice according to the same emotion, and both the emotional intensity and the emotional expression change. However, in general, synthesized speech is generally consistent in emotional intensity and emotional expression, and thus the speech has too single expression power to meet the user's demand.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, the method comprising:

acquiring a text to be synthesized and an appointed emotion type;

extracting a phoneme sequence corresponding to the text to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes;

inputting the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, wherein the audio frame corresponding to each phoneme in the target audio is matched with the acoustic feature corresponding to the phoneme in an acoustic feature sequence, the acoustic feature sequence is determined by the speech synthesis model according to the phoneme sequence, the acoustic feature sequence comprises the acoustic feature corresponding to each phoneme, and the acoustic feature is used for indicating the prosodic feature of the phoneme.

In a second aspect, the present disclosure provides a speech synthesis apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a text to be synthesized and an appointed emotion type;

the extraction module is used for extracting a phoneme sequence corresponding to the text to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes;

a synthesis module, configured to input the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, where an audio frame corresponding to each phoneme in the target audio matches an acoustic feature corresponding to the phoneme in an acoustic feature sequence, where the acoustic feature sequence is determined by the speech synthesis model according to the phoneme sequence, the acoustic feature sequence includes an acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate a prosodic feature of the phoneme.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the first aspect of the present disclosure.

According to the technical scheme, firstly, a text to be synthesized and an appointed emotion type are obtained, then a phoneme sequence which comprises a plurality of phonemes and corresponds to the text to be synthesized is extracted, finally the phoneme sequence and the appointed emotion type are used as input of a pre-trained speech synthesis model, and therefore a target audio which is output by the speech synthesis model and corresponds to the text to be synthesized is obtained, the target audio has the appointed emotion type, and an audio frame which corresponds to each phoneme in the target audio is matched with an acoustic feature which corresponds to the phoneme in an acoustic feature sequence. The acoustic feature sequence is determined by a speech synthesis model according to the phoneme sequence, and the acoustic feature sequence comprises acoustic features corresponding to each phoneme. The target audio is controlled by specifying the emotion type, and the acoustic feature of each phoneme corresponding to the text can be predicted, so that the rhythm of the target audio at the phoneme level can be controlled by the acoustic feature, the emotion type at the text level and the control of two dimensions of the acoustic feature at the phoneme level can be realized in the voice synthesis process, and the expressive force of the target audio is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of speech synthesis according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating another method of speech synthesis according to an example embodiment;

FIG. 3 is a flow diagram illustrating another method of speech synthesis according to an example embodiment;

FIG. 4 is a process flow diagram illustrating a speech synthesis model in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a speech synthesis model in accordance with an exemplary embodiment;

FIG. 6 is a process flow diagram illustrating another speech synthesis model in accordance with an exemplary embodiment;

FIG. 7 is a flow diagram illustrating training a speech synthesis model according to an exemplary embodiment;

FIG. 8 is a flow diagram illustrating another method of training a speech synthesis model in accordance with an illustrative embodiment;

FIG. 9 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment;

FIG. 10 is a block diagram illustrating another speech synthesis apparatus in accordance with an illustrative embodiment;

FIG. 11 is a block diagram illustrating another speech synthesis apparatus in accordance with an illustrative embodiment;

FIG. 12 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

FIG. 1 is a flow diagram illustrating a method of speech synthesis, as shown in FIG. 1, according to an exemplary embodiment, the method comprising:

step 101, acquiring a text to be synthesized and an appointed emotion type.

For example, a text to be synthesized that needs to be synthesized is first obtained. The text to be synthesized may be, for example, one or more sentences in a text file specified by a user, one or more paragraphs, one or more chapters in the text file, or one or more words in one text file. The text file may be, for example, an electronic book, or may be other types of files, such as news, public articles, blogs, and the like. Meanwhile, a specified emotion type can be acquired, and the specified emotion type can be understood as that a user specifies and expects to synthesize the text to be synthesized into audio with the specified emotion type (namely, target audio mentioned later). The specified emotion type may be in the form of an emotion tag, such as happy for 0001, surprised for 0011, hated for 1010, angry for 1011, shy for 0101, fear for 0100, sad for 1000, and shivering for 1001.

Step 102, extracting a phoneme sequence corresponding to the text to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes.

For example, the text to be synthesized may be input into a pre-trained recognition model to obtain a phoneme sequence corresponding to the text to be synthesized, which is output by the recognition model. Or searching the phoneme corresponding to each word in the text to be synthesized in a pre-established dictionary, and then taking the phoneme corresponding to each word as the phoneme sequence corresponding to the text to be synthesized. The phoneme can be understood as a phonetic unit divided according to the pronunciation of each word, and can also be understood as a vowel and a consonant in the pinyin corresponding to each word. The phoneme sequence includes a phoneme corresponding to each word in the text to be synthesized (one word may correspond to one or more phonemes). For example, the text to be synthesized is "sunny on grass", and phonemes corresponding to each word may be sequentially searched in the dictionary, thereby determining that the phoneme sequence is "yangguang sazaicaodishan".

Step 103, inputting the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, which is output by the speech synthesis model, wherein the audio frame corresponding to each phoneme in the target audio is matched with the acoustic feature corresponding to the phoneme in the acoustic feature sequence, the acoustic feature sequence is determined by the speech synthesis model according to the phoneme sequence, the acoustic feature sequence comprises the acoustic feature corresponding to each phoneme, and the acoustic feature is used for indicating the prosodic feature of the phoneme.

For example, after obtaining the phoneme sequence, the phoneme sequence and the specified emotion type may be used as input of a pre-trained speech synthesis model, and the speech synthesis model may first determine a corresponding acoustic feature sequence according to the phoneme sequence, where the acoustic feature sequence includes an acoustic feature corresponding to each phoneme. Here, the acoustic feature may be understood as a prosodic feature for characterizing each phoneme. The acoustic features may include multiple dimensions, and may include, for example, one or more of fundamental frequency (Pitch), volume (Energy), and speech rate (Duration), and may further include: noise level (which may be understood as a characteristic that reflects the magnitude of noise in audio), pitch, timbre, loudness, etc., and this disclosure is not limited in this regard. It is understood that the speech synthesis model can predict the acoustic features corresponding to each phoneme according to the input phoneme sequence. Then, the speech synthesis model can output target audio corresponding to the text to be synthesized and having the specified emotion type according to the phoneme sequence, the acoustic feature sequence and the specified emotion type, and an audio frame corresponding to each phoneme in the target audio is matched with the acoustic feature corresponding to the phoneme in the acoustic feature sequence. It can be understood that the target audio includes a plurality of audio frames, and the target audio is divided according to the phonemes included in the phoneme sequence, so as to obtain one or more audio frames corresponding to each phoneme, where the audio frame corresponding to each phoneme satisfies the acoustic feature corresponding to the phoneme. For example, a phoneme corresponds to an acoustic feature with a fundamental frequency of 50Hz, a volume of 30dB, and a speech rate (which can be understood as the duration of the phoneme) of 60 ms. If the frame rate of the target audio is 20 ms/frame, the number of audio frames corresponding to the phoneme in the target audio is 3, the fundamental frequencies of the corresponding 3 audio frames are all 50Hz, and the volume is all 30 dB. The voice synthesis model is pre-trained, can be understood as a TTS (Text To Speech, Chinese from Text To voice) model, and can generate a target audio which corresponds To a Text To be synthesized, has a specified emotion type and is matched with an acoustic feature sequence according To a phoneme sequence and the specified emotion type. Specifically, the speech synthesis model may be trained based on a Tacotron model, a Deepvoice 3 model, a Tacotron 2 model, and the like, which is not specifically limited in this disclosure. In this way, the speech synthesis model can control the emotion of the target audio according to the specified emotion type, and can predict the acoustic feature corresponding to each phoneme according to the phoneme sequence corresponding to the text to be synthesized, so as to control the prosody at the phoneme level in the target audio according to the acoustic feature corresponding to each phoneme. Therefore, the method and the device can realize the control of the text-level emotion types and the phoneme-level acoustic features in the voice synthesis process, improve the expressive force of the target audio, and make the target audio express various emotion intensities and emotion expressions on the basis of the appointed emotion types, thereby being closer to real voice and improving the auditory experience of users.

In summary, according to the present disclosure, a text to be synthesized and an appointed emotion type are first obtained, a phoneme sequence including a plurality of phonemes and corresponding to the text to be synthesized is then extracted, and finally, the phoneme sequence and the appointed emotion type are used as inputs of a pre-trained speech synthesis model, so as to obtain a target audio output by the speech synthesis model and corresponding to the text to be synthesized, where the target audio has the appointed emotion type, and an audio frame corresponding to each phoneme in the target audio is matched with an acoustic feature corresponding to the phoneme in an acoustic feature sequence. The acoustic feature sequence is determined by a speech synthesis model according to the phoneme sequence, and the acoustic feature sequence comprises acoustic features corresponding to each phoneme. The target audio is controlled by specifying the emotion type, and the acoustic feature of each phoneme corresponding to the text can be predicted, so that the rhythm of the target audio at the phoneme level can be controlled by the acoustic feature, the emotion type at the text level and the control of two dimensions of the acoustic feature at the phoneme level can be realized in the voice synthesis process, and the expressive force of the target audio is improved.

Fig. 2 is a flow diagram illustrating another speech synthesis method according to an example embodiment, which may further include, as shown in fig. 2:

and 104, acquiring the tone color code corresponding to the specified speaker.

Correspondingly, the implementation manner of step 103 is:

and coding the phoneme sequence, the designated emotion type and the timbre, and inputting the phoneme sequence, the designated emotion type and the timbre into a voice synthesis model to obtain a target audio output by the voice synthesis model, wherein the target audio has the timbre of a designated speaker, the voice synthesis model is obtained by training according to the corpus corresponding to a plurality of speakers, and the plurality of speakers comprise the designated speaker.

In an application scenario, a tone color code corresponding to a specified speaker may also be obtained, where the specified speaker may be understood as a target audio that is specified by a user and is expected to be synthesized into a text to be synthesized with the tone color of the specified speaker. Accordingly, in generating the target audio, the phoneme sequence, the specified emotion type, and the timbre coding may be input together into the speech synthesis model, thereby causing the speech synthesis model to output the target audio having the timbre of the specified speaker. It can be understood that the speech synthesis model is obtained by training according to the corpus corresponding to a plurality of speakers in advance, and can generate audio with tone colors of different speakers, and the specified speaker may be any one of the plurality of speakers. For example, the speech synthesis model is obtained by training corpus of a speaker a, a speaker b, and a speaker c, the tone encoding corresponding to the speaker a may be 001, the tone encoding corresponding to the speaker b may be 010, and the tone encoding corresponding to the speaker c may be 100. If the designated speaker is speaker C, the corresponding tone is encoded as 100, and the phoneme sequence, the designated emotion type and 100 can be input into the speech synthesis model together to obtain the target audio with the tone of speaker C.

Fig. 3 is a flow diagram illustrating another speech synthesis method according to an example embodiment, which may further include, as shown in fig. 3:

step 105, obtaining a specified acoustic feature sequence, where the specified acoustic feature sequence includes specified acoustic features corresponding to each phoneme.

Correspondingly, the implementation manner of step 103 is:

and inputting the phoneme sequence, the specified emotion type and the specified acoustic feature sequence into the speech synthesis model to obtain target audio output by the speech synthesis model, wherein the audio frame corresponding to each phoneme in the target audio is matched with the specified acoustic feature corresponding to the phoneme in the specified acoustic feature sequence.

In another application scenario, a specified acoustic feature sequence may also be obtained, where the specified acoustic features corresponding to each phoneme are included. Specifying a sequence of acoustic features may be understood as a user-specified desire to synthesize the text to be synthesized into target audio that matches the specified sequence of acoustic features. Accordingly, when generating the target audio, the phoneme sequence, the specified emotion type and the specified acoustic feature sequence may be input into the speech synthesis model together, so that the audio frame corresponding to each phoneme in the target audio output by the speech synthesis model matches the specified acoustic feature corresponding to the phoneme in the specified acoustic feature sequence. It can be understood that if the specified acoustic feature sequence is input into the speech synthesis model, the speech synthesis model does not need to determine the corresponding acoustic feature sequence according to the phoneme sequence, and can directly generate the target audio according to the phoneme sequence, the specified acoustic feature sequence and the specified emotion type. Therefore, the emotion type of the text level and the acoustic feature of the phoneme level can be controlled explicitly in two dimensions in the voice synthesis process, and the expressive force of the target audio is further improved. For example, for an audio production studio, a designer can adjust the specified acoustic features of each phoneme one by one, so as to adjust the emotional intensity and emotional expression of the target audio, further improve the expressive power of the target audio, and improve the flexibility and operability of audio production.

FIG. 4 is a process flow diagram illustrating a speech synthesis model, as shown in FIG. 4, that may be used to perform the following steps, according to an example embodiment:

and step A, determining a text characteristic sequence corresponding to the text to be synthesized according to the phoneme sequence, wherein the text characteristic sequence comprises text characteristics corresponding to each phoneme.

And step B, determining an acoustic feature sequence according to the text feature sequence.

And step C, determining the appointed emotional characteristics corresponding to the appointed emotional types, and expanding the appointed emotional characteristics according to the phoneme sequence to obtain an emotional characteristic sequence.

And D, generating a target audio according to the text characteristic sequence, the acoustic characteristic sequence and the emotion characteristic sequence.

For example, in a specific process of synthesizing a target audio by using a speech synthesis model, a Text feature sequence (Text Embedding) corresponding to a Text to be synthesized may be extracted according to a phoneme sequence, where the Text feature sequence includes Text features corresponding to each phoneme in the phoneme sequence, and the Text features may be understood as Text vectors capable of representing the phonemes. For example, the phoneme sequence includes 100 phonemes, and the text vector corresponding to each phoneme is a 1 × 80 dimensional vector, so the text feature sequence may be a 100 × 80 dimensional vector.

The acoustic feature sequence may then be determined from the text feature sequence. It is understood that the speech synthesis model can predict the acoustic features corresponding to each phoneme according to the text features corresponding to the phoneme, so as to obtain the acoustic feature sequence.

Then, the voice synthesis model can also determine corresponding specified emotion characteristics according to the specified emotion types, and the specified emotion characteristics can be understood as emotion vectors capable of representing the specified emotion types. In one approach, the speech synthesis model may include a Lookup Table (Lookup Table) that maps tags corresponding to the specified emotion types into a multidimensional emotion vector. For example, if the emotion type is designated as mimicry and the corresponding label is 0101, the look-up table can map 0101 to a 1 x 50-dimensional emotion vector as the designated emotional feature for characterizing mimicry. In another application scenario, the speech synthesis model may include a GST module, where the GST module includes a plurality of GSTs (Global Style flags, english: Global Style Token, chinese: Global Style flag), and the GST module may use a plurality of pre-trained GSTs to represent the specified emotion type, that is, the specified emotion vector may be obtained by performing weighted summation on the plurality of GSTs. The GST module may be independently trained in advance according to a large number of training samples, or may be obtained by training in combination with a speech synthesis model, which is not specifically limited in this disclosure. Further, after obtaining the specified emotion feature corresponding to the specified emotion type, the specified emotion feature may be further extended according to the phoneme sequence to obtain an emotion feature sequence, where the emotion feature sequence includes the emotion feature corresponding to each phoneme in the phoneme sequence. For example, the length of the phoneme sequence, i.e. the number of phonemes comprised in the phoneme sequence, may be determined first. And then copying the appointed emotional features to obtain an emotional feature sequence with the same length as the phoneme sequence, wherein each emotional feature is the same as the appointed emotional feature, namely, the emotional features corresponding to each phoneme in the emotional feature sequence are the appointed emotional features. For example, if the length of the phoneme sequence is 100 (i.e. 100 phonemes are included), the emotion characteristics corresponding to each phoneme can be determined as the specified emotion characteristics, and then the emotion characteristics corresponding to 100 phonemes can be combined into the emotion characteristic sequence. Taking the example of a vector with 1 × 50 dimensions as the assigned emotional features, the sequence of emotional features includes 100 vectors with 1 × 50 dimensions, and may constitute a vector with 100 × 50 dimensions.

After the text feature sequence, the acoustic feature sequence, and the emotion feature sequence are obtained, the text feature sequence, the emotion feature sequence, and the acoustic feature sequence may be combined to generate a target audio having a specified emotion type and corresponding to each phoneme and matching the acoustic feature corresponding to the phoneme. For example, the text feature sequence, the acoustic feature sequence, and the emotion feature sequence may be spliced to obtain a combined sequence, and then the target audio may be generated according to the combined sequence. For example, if the phoneme sequence includes 100 phonemes, the text feature sequence may be a vector of 100 × 80 dimensions, the corresponding acoustic feature sequence may be a vector of 100 × 5 dimensions, and the emotion feature sequence may be a vector of 100 × 50 dimensions, then the combined sequence may be a vector of 100 × 135 dimensions. The target audio may be generated from this 100 x 135 dimensional vector.

Taking the speech synthesis model shown in fig. 5 as an example, the speech synthesis model may be a GST-based tacontron model, which includes: a text Encoder (i.e., an Encoder), an acoustic feature Encoder, a GST module, and a synthesizer, which may include an attention network, a Decoder (i.e., a Decoder), and a post-processing network. The GST module may determine weights corresponding to the plurality of GSTs respectively according to the specified emotion types, and then perform weighted summation on the plurality of GSTs to obtain the specified emotion vector. And then, the appointed emotional features can be expanded to obtain an emotional feature sequence.

The text encoder may include an Embedding layer (i.e., the Character Embedding layer), a Pre-processing network (Pre-network) sub-model, and a CBHG (english: convergence Bank + high-way network + bidirectional Gated Recurrent Unit, chinese: convolutional layer + high-speed network + bidirectional Recurrent neural network) sub-model. The phoneme sequence can be input into a text encoder, firstly, the phoneme sequence is converted into a word vector through an embedding layer, then, the word vector is input into a Pre-net sub-model to carry out nonlinear transformation on the word vector, so that the convergence and generalization capability of a speech synthesis model is improved, and finally, a text feature sequence capable of representing a text to be synthesized is obtained through a CBHG sub-model according to the word vector after the nonlinear transformation.

The text feature sequence may then be input to an acoustic feature encoder, which may predict the acoustic features corresponding to each phoneme to obtain an acoustic feature sequence. The structure of the acoustic encoder may include, for example, 3 layers of LSTM (Long Short-Term Memory, chinese) and a linear layer. The acoustic encoder may be trained independently in advance based on a large number of training samples, or may be trained in combination with a speech synthesis model, which is not specifically limited by the present disclosure.

Finally, the acoustic feature sequence, the emotional feature sequence and the text feature sequence can be spliced to obtain a combined sequence, and then the combined sequence is input into an attention network, wherein the attention network can add an attention weight to each element in the combined sequence. Specifically, the Attention network may be a location Sensitive Attention (location Sensitive Attention) network, a GMM (Gaussian Mixture Model, abbreviated as GMM) authentication network, or a Multi-Head authentication network, which is not limited in this disclosure. The output of the attention network is then used as the input of the decoder. The Decoder may include a pre-processing network sub-model (which may be the same as the pre-processing network sub-model included in the encoder), an Attention-RNN, a Decoder-RNN. The preprocessing network submodel is used for carrying out nonlinear transformation on input, the structure of the Attention-RNN is a layer of one-way zoneout-based LSTM, the output of the preprocessing network submodel can be used as input, and the input is output to the Decoder-RNN after passing through the LSTM unit. The Decode-RNN is a two-layer one-way zoneout-based LSTM, and outputs Mel frequency spectrum information through an LSTM unit, wherein the Mel frequency spectrum information can comprise one or more Mel frequency spectrum characteristics. The mel-frequency spectrum information is finally input into a post-processing network, which may include a vocoder (e.g., a Wavenet vocoder, a Griffin-Lim vocoder, etc.) for converting the mel-frequency spectrum feature information to obtain the target audio.

Further, in a scenario where the speech synthesis model is capable of generating audio with timbres of different speakers, the speech synthesis model may further include a timbre high-dimensional mapping table, which may encode the timbre corresponding to the specified speaker, map the timbre into a multi-dimensional timbre vector capable of representing the timbre of the specified speaker, and input the timbre vector into a decoder included in the synthesizer.

Fig. 6 is a flowchart illustrating another processing of a speech synthesis model according to an exemplary embodiment, where, as shown in fig. 6, the speech synthesis model includes a plurality of pre-trained GSTs, and a corresponding implementation manner of step C may include:

and step C1, determining the emotion codes of the appointed emotion types.

And step C2, using the emotion codes of the appointed emotion types as weighting coefficients of the GSTs to obtain appointed emotion characteristics.

For example, the corresponding emotion encoding may be determined according to the specified emotion type, and the emotion encoding may be understood as encoding for identifying the specified emotion type. For 8 emotions, namely, happy, surprised, hated, angry, shy, frightened, sad and desquamated, the 8 emotions can be represented by an 8-bit binary code, the emotion code corresponding to the happy emotion can be 00000001, the emotion code corresponding to the surprised emotion can be 00000010, the emotion code corresponding to the hated emotion can be 00000100, the emotion code corresponding to the angry emotion can be 00001000, and the like.

Generally, a GST module includes a reference encoder, an attention module and a plurality of GSTs, to obtain a specified emotion vector capable of representing a specified emotion type, it is necessary to input audio with the specified emotion type (or mel-frequency spectrum information of the audio) into the reference encoder, the reference encoder outputs a high-dimensional vector capable of representing acoustic features of the audio, then the high-dimensional vector and the plurality of GSTs are input into the attention module, the attention module outputs weights corresponding to the plurality of GSTs, and the plurality of GSTs are weighted and summed to obtain the specified emotion vector. However, the stability of obtaining the designated emotion vector in this way is low, for example, two different audios with happy emotions are input, and the designated emotion vector output by the GST module is different, so that the stability of representing the designated emotion type in this way is low.

According to the method, in the process of training the GST module, the weights corresponding to the GSTs output by the attention module and the cross entropy loss of emotion coding can be obtained, the cross entropy loss is reduced, parameters of neurons in the GST module are corrected by using a back propagation algorithm, and under the condition of convergence of the cross entropy loss, the GSTs can be in one-to-one correspondence with various emotion types, namely, each GST corresponds to one emotion type. Therefore, the emotion encoding of the specified emotion type determined in step C1 can be directly used as the weighting coefficients of the GSTs to obtain the specified emotion characteristics. For example, the GST module includes 8 GSTs, and the emotion code corresponding to hate is 00000100, and then the obtained emotion feature corresponding to hate is 3 rd GST with 00000100 as a weighting coefficient. Therefore, the same emotion type is obtained by the GST module, and the stability of the GST module is effectively improved.

FIG. 7 is a flow diagram illustrating training of a speech synthesis model, as shown in FIG. 7, trained in the following manner, according to an exemplary embodiment:

step E, determining a real acoustic feature sequence according to the training audio corresponding to the training text and the training phoneme sequence corresponding to the training text, wherein the real acoustic feature sequence comprises: and training the real acoustic characteristics corresponding to each training phoneme in the phoneme sequence.

And F, acquiring emotion audio, wherein the emotion type of the emotion audio is the same as the emotion type of the training audio.

And G, inputting the training phoneme sequence, the emotion audio and the real acoustic feature sequence into a voice synthesis model, and training the voice synthesis model according to the output of the voice synthesis model and the training audio, wherein the voice synthesis model is used for determining the training emotion features according to the emotion audio, and the training emotion features are used for representing the emotion types of the emotion audio.

The following describes the training process of the speech synthesis model. Firstly, a training text, a training audio corresponding to the training text, and a training phoneme sequence corresponding to the training text are required to be obtained, and there may be a plurality of training texts, and correspondingly, there are a plurality of training audios and training phoneme sequences. For example, a large amount of text may be captured on the internet as a training text, then the audio corresponding to the training text is used as a training audio, and then a training phoneme sequence may be determined, where the training phoneme sequence includes training phonemes corresponding to each word in the training text.

Further, a corresponding training acoustic feature sequence may be extracted according to the training audio and the training phoneme sequence, where the training acoustic feature sequence includes a real acoustic feature corresponding to each training phoneme. For example, the real acoustic features of each audio frame in the training audio can be obtained through signal processing, labeling and the like. Wherein the real acoustic features of each audio frame are used for indicating the prosodic features of the audio frame, the method may include: at least one of the fundamental frequency, the volume and the speech rate of the audio frame may further include: noise level, pitch, timbre, loudness, etc. And then, determining one or more audio frames corresponding to each training phoneme according to the training phoneme sequence, and determining the real acoustic features corresponding to the training phonemes, thereby obtaining a training acoustic feature sequence. The training acoustic feature sequence includes the real acoustic features corresponding to each training phoneme. For example, a training phoneme corresponds to 3 audio frames, and then the mean (or the maximum value or the minimum value) of the real acoustic features of the 3 audio frames may be used as the real acoustic features corresponding to the training phoneme.

The emotion audio can then be retrieved, with the emotion type of the emotion audio being the same as the emotion type of the training audio. That is, the emotion audio only needs to have the same emotion type as the training audio, and the emotion audio may be the training audio or may be different from the training audio. For example, the emotion audio can be obtained from a preset sound library according to the emotion type of the training audio.

And finally, taking the training phoneme sequence, the emotion audio and the real acoustic feature sequence as the input of the voice synthesis model, and training the voice synthesis model according to the output of the voice synthesis model and the training audio. Meanwhile, the speech synthesis model can also determine training emotional characteristics according to the emotional audio, wherein the training emotional characteristics are used for representing the emotional types of the emotional audio. The process of training the speech synthesis model may, for example, use a back propagation algorithm to modify parameters of neurons in the speech synthesis model, such as weights (english: Weight) and offsets (english: Bias) of the neurons, according to the output of the speech synthesis model and the difference (or mean square error) between the training audio as a loss function of the speech synthesis model, and aiming at reducing the loss function. And repeating the steps until the loss function meets a preset condition, for example, the loss function is smaller than a preset loss threshold.

FIG. 8 is a flow diagram illustrating another method for training a speech synthesis model, such as that shown in FIG. 8, exemplified by the speech synthesis model shown in FIG. 5, according to an illustrative embodiment, including: text encoder, acoustic feature encoder, GST module and synthesizer. The GST module includes a reference encoder and an emotion markup layer (not shown in fig. 5) including a plurality of GSTs. Further, a blocking structure is arranged between the text encoder and the acoustic feature encoder and used for preventing the acoustic feature encoder from transmitting the gradient back to the text encoder.

The blocking structure may be stop _ gradient (), for example, and is used to truncate the loss of the acoustic feature encoder and thus prevent the acoustic feature encoder from passing the gradient back to the text encoder. That is, when the acoustic feature encoder is adjusted according to the corresponding loss function (i.e., the second loss mentioned later), the text encoder is not affected, so that the problem of unstable training of the speech synthesis model is avoided.

Accordingly, the implementation manner of step G may include:

and G1, extracting a training text feature sequence corresponding to the training text through a text encoder, wherein the training text feature sequence comprises training text features corresponding to each training phoneme.

And G2, extracting a predicted acoustic feature sequence corresponding to the training text feature sequence through the acoustic feature encoder, wherein the predicted acoustic feature sequence comprises a predicted acoustic feature corresponding to each training phoneme.

And G3, extracting a reference vector corresponding to the emotion audio through a reference encoder, and determining training weighting coefficients corresponding to the GSTs according to the reference vector through an emotion mark layer.

And G4, generating the output of the voice synthesis model according to the training text feature sequence, the real acoustic feature sequence and the training emotional feature sequence through a synthesizer, wherein the training emotional feature sequence is obtained by extending the training emotional features according to the training phoneme sequence, and the training emotional features are determined according to the training weighting coefficients and the GSTs.

For example, when training the speech synthesis model, a training text may be input into a text encoder to obtain a training text feature sequence output by the text encoder, where the training text feature sequence includes a training text feature corresponding to each training phoneme, and the training text feature may be understood as a text vector capable of characterizing the training phoneme.

Then, the training text feature sequence may be input into an acoustic feature encoder to obtain a predicted acoustic feature sequence output by the acoustic encoder, where the predicted acoustic feature sequence includes a predicted acoustic feature corresponding to each training phoneme, and the predicted acoustic feature may be understood as a predicted value of the acoustic feature corresponding to the training phoneme by the acoustic encoder. The acoustic coder can be independently trained in advance according to a large number of training samples, or can be obtained by combining training with a speech synthesis model. For example, a loss function (i.e., the second loss mentioned later) may be determined according to the predicted acoustic feature sequence and the real acoustic feature sequence determined in step E while training the speech synthesis model, and the acoustic encoder may be trained by using the loss function.

The emotion audio may then be input to the reference encoder to obtain a reference vector output by the reference encoder, which may be understood as a vector of acoustic features that can characterize the emotion audio. And inputting the reference vector into the emotion mark layer to obtain training weighting coefficients corresponding to the GSTs and output by the emotion mark layer. The emotion mark layer may include an attention module and a plurality of GSTs. After the training weighting coefficients are obtained, the plurality of GSTs may be weighted and summed according to the training weighting coefficients to obtain training emotional features that characterize the emotional types of the emotional audio, for example, the emotional tag layer includes 3 GSTs: A. b, C, the training weight coefficients are 0.65, 0.2 and 0.15, and the training emotional characteristics are 0.65A +0.2B + 0.15C. And then, expanding according to the training phoneme sequence to obtain a training emotion characteristic sequence, wherein the training emotion characteristic sequence comprises the training emotion characteristic corresponding to each training phoneme. The GST module can be trained independently according to a large number of training samples in advance, or can be obtained by combining with a speech synthesis model. For example, a loss function (i.e., the third loss mentioned later) may be determined according to the training weighting coefficient and the emotion type of emotion audio while training the speech synthesis model, and the GST module may be trained using the loss function.

And finally, splicing the training text characteristic sequence, the real acoustic characteristic sequence and the training emotion characteristic sequence, and inputting the spliced training text characteristic sequence, the real acoustic characteristic sequence and the training emotion characteristic sequence into a synthesizer together to generate the output of the voice synthesis model.

In one application scenario, the loss function of the speech synthesis model is determined by a first loss determined by the output of the speech synthesis model and the training audio, a second loss determined by the predicted acoustic feature sequence and the real acoustic feature sequence, and a third loss determined by the training weighting factor and the emotion encoding of the emotion type of the emotion audio.

For example, the loss function of the speech synthesis model may be determined by the first loss, the second loss and the third loss, such as a weighted sum of the first loss, the second loss and the third loss, or an average of the first loss, the second loss and the third loss. The first loss may be understood as a loss function determined by a difference (or a mean square error) of a training audio corresponding to a training text according to an output of the speech synthesis model. The second loss can be understood as a loss function of the acoustic encoder, i.e. a loss function determined from a difference (which may also be a mean square error) between a predicted acoustic feature sequence output by the acoustic encoder and a true acoustic feature sequence. The third loss can be understood as a loss function of the GST module, namely, a loss function determined according to the training weighting coefficient output by the emotion mark layer and the emotion type of the emotion audio. Specifically, an emotion classifier may be provided, and the training weighting coefficient and the emotion code (for example, one-hot code) corresponding to the emotion type of the emotion audio are input into the emotion classifier, so as to obtain the training weighting coefficient, the cross entropy loss of the emotion code, and the cross entropy loss is used as the third loss. Therefore, in the process of training the speech synthesis model, the weights and the connection relations of the neurons in the speech synthesis model can be integrally adjusted, and the weights and the connection relations of the neurons in the acoustic encoder and the GST module can be adjusted, so that the accuracy and the effectiveness of the speech synthesis model, the acoustic encoder and the GST module are ensured.

Fig. 9 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment, and as shown in fig. 9, the apparatus 200 includes:

the first obtaining module 201 is configured to obtain a text to be synthesized and a specified emotion type.

The extracting module 202 is configured to extract a phoneme sequence corresponding to the text to be synthesized, where the phoneme sequence includes a plurality of phonemes.

The synthesis module 203 is configured to input the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, where the audio frame corresponding to each phoneme in the target audio is matched with the acoustic feature corresponding to the phoneme in the acoustic feature sequence, the acoustic feature sequence is determined by the speech synthesis model according to the phoneme sequence, the acoustic feature sequence includes the acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate a prosodic feature of the phoneme.

Fig. 10 is a block diagram illustrating another speech synthesis apparatus according to an exemplary embodiment, and as shown in fig. 10, the apparatus 200 further includes:

and a second obtaining module 204, configured to obtain a timbre encoding corresponding to the specified speaker.

Accordingly, the synthesis module 203 is configured to:

Fig. 11 is a block diagram illustrating another speech synthesis apparatus according to an exemplary embodiment, and as shown in fig. 11, the apparatus 200 further includes:

the third obtaining module 205 is configured to obtain a specified acoustic feature sequence, where the specified acoustic feature sequence includes a specified acoustic feature corresponding to each phoneme.

Accordingly, the synthesis module 203 is configured to:

In one application scenario, the acoustic features include: at least one of fundamental frequency, volume and speech rate.

In an application scenario, the speech synthesis model may be used to perform the following steps:

In one implementation, the implementation of step C may include:

and step C1, determining the emotion codes of the appointed emotion types.

In an application scenario, the speech synthesis model is obtained by training as follows:

In one implementation, a speech synthesis model includes: text encoder, acoustic feature encoder, GST module and synthesizer. The GST module includes a reference encoder and an emotion flag layer, and the emotion flag layer includes a plurality of GSTs. Further, a blocking structure is arranged between the text encoder and the acoustic feature encoder and used for preventing the acoustic feature encoder from transmitting the gradient back to the text encoder.

Accordingly, the implementation manner of step G may include:

In another implementation, the loss function of the speech synthesis model is determined by a first loss determined by the output of the speech synthesis model and the training audio, a second loss determined by the predicted acoustic feature sequence and the true acoustic feature sequence, and a third loss determined by the training weighting factor and the emotion encoding of the emotion type possessed by the emotion audio.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring now to fig. 12, a schematic structural diagram of an electronic device (e.g., an execution subject, which may be a terminal device or a server in the above embodiments) 300 suitable for implementing an embodiment of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 12, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 12 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 309, or installed from the storage means 308, or installed from the ROM 302. The computer program, when executed by the processing device 301, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the terminal devices, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be synthesized and an appointed emotion type; extracting a phoneme sequence corresponding to the text to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes; inputting the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, wherein the audio frame corresponding to each phoneme in the target audio is matched with the acoustic feature corresponding to the phoneme in an acoustic feature sequence, the acoustic feature sequence is determined by the speech synthesis model according to the phoneme sequence, the acoustic feature sequence comprises the acoustic feature corresponding to each phoneme, and the acoustic feature is used for indicating the prosodic feature of the phoneme.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module does not in some cases constitute a limitation of the module itself, for example, the first retrieving module may also be described as a "module that retrieves the text to be synthesized and specifies the emotion type".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a speech synthesis method, according to one or more embodiments of the present disclosure, including: acquiring a text to be synthesized and an appointed emotion type; extracting a phoneme sequence corresponding to the text to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes; inputting the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, wherein the audio frame corresponding to each phoneme in the target audio is matched with the acoustic feature corresponding to the phoneme in an acoustic feature sequence, the acoustic feature sequence is determined by the speech synthesis model according to the phoneme sequence, the acoustic feature sequence comprises the acoustic feature corresponding to each phoneme, and the acoustic feature is used for indicating the prosodic feature of the phoneme.

Example 2 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure: acquiring a tone color code corresponding to a specified speaker; the inputting the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, which is output by the speech synthesis model, includes: and coding the phoneme sequence, the appointed emotion type and the tone, inputting the phoneme sequence, the appointed emotion type and the tone into the voice synthesis model to obtain the target audio output by the voice synthesis model, wherein the target audio has the tone of the appointed speaker, the voice synthesis model is obtained by training according to the linguistic data corresponding to a plurality of speakers, and the speakers comprise the appointed speaker.

Example 3 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure: acquiring a specified acoustic feature sequence, wherein the specified acoustic feature sequence comprises specified acoustic features corresponding to each phoneme; the inputting the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, which is output by the speech synthesis model, includes: and inputting the phoneme sequence, the specified emotion type and the specified acoustic feature sequence into the speech synthesis model to obtain the target audio output by the speech synthesis model, wherein the audio frame corresponding to each phoneme in the target audio is matched with the specified acoustic feature corresponding to the phoneme in the specified acoustic feature sequence.

Example 4 provides the methods of examples 1-3, the speech synthesis model to: determining a text feature sequence corresponding to the text to be synthesized according to the phoneme sequence, wherein the text feature sequence comprises a text feature corresponding to each phoneme; determining the acoustic feature sequence according to the text feature sequence; determining appointed emotional characteristics corresponding to the appointed emotional types, and expanding the appointed emotional characteristics according to the phoneme sequence to obtain an emotional characteristic sequence; and generating the target audio according to the text feature sequence, the acoustic feature sequence and the emotion feature sequence.

Example 5 provides the method of example 4, where the speech synthesis model includes a plurality of global style tags GST trained in advance, and the determining the specified emotion feature corresponding to the specified emotion type includes: determining an emotion encoding for the specified emotion type; and using the emotion codes of the specified emotion types as weighting coefficients of the GSTs to obtain the specified emotion characteristics.

Example 6 provides the method of example 1, the acoustic features including: at least one of fundamental frequency, volume and speech rate.

Example 7 provides the methods of examples 1-3, the speech synthesis model being obtained by training in the following manner: determining a real acoustic feature sequence according to a training audio corresponding to a training text and a training phoneme sequence corresponding to the training text, wherein the real acoustic feature sequence comprises: real acoustic features corresponding to each training phoneme in the training phoneme sequence; obtaining emotion audio, wherein the emotion audio has an emotion type which is the same as the emotion type of the training audio; inputting the training phoneme sequence, the emotion audio and the real acoustic feature sequence into the speech synthesis model, and training the speech synthesis model according to the output of the speech synthesis model and the training audio, wherein the speech synthesis model is used for determining training emotion features according to the emotion audio, and the training emotion features are used for representing emotion types of the emotion audio.

Example 8 provides the method of example 7, the speech synthesis model comprising: the device comprises a text encoder, an acoustic feature encoder, a GST module and a synthesizer; the GST module comprises a reference encoder and an emotion mark layer, wherein the emotion mark layer comprises a plurality of GSTs; a stopping structure is arranged between the text encoder and the acoustic feature encoder and is used for stopping the acoustic feature encoder from transmitting the gradient back to the text encoder; the inputting the training phoneme sequence, the emotion audio and the real acoustic feature sequence into the speech synthesis model, and training the speech synthesis model according to the output of the speech synthesis model and the training audio, including: extracting a training text feature sequence corresponding to the training text through the text encoder, wherein the training text feature sequence comprises a training text feature corresponding to each training phoneme; extracting a predicted acoustic feature sequence corresponding to the training text feature sequence through the acoustic feature encoder, wherein the predicted acoustic feature sequence comprises a predicted acoustic feature corresponding to each training phoneme; extracting a reference vector corresponding to the emotion audio through the reference encoder, and determining a plurality of training weighting coefficients corresponding to the GSTs according to the reference vector through the emotion mark layer; and generating the output of the voice synthesis model according to the training text feature sequence, the real acoustic feature sequence and the training emotional feature sequence through the synthesizer, wherein the training emotional feature sequence is obtained by extending the training emotional features according to the training phoneme sequence, and the training emotional features are determined according to the training weighting coefficients and the plurality of GSTs.

Example 9 provides the method of example 8, the loss function of the speech synthesis model being determined by a first loss determined by an output of the speech synthesis model and the training audio, a second loss determined by the predicted acoustic feature sequence and the true acoustic feature sequence, and a third loss determined by the training weighting coefficient and emotion encoding of the type of emotion that the emotion audio has.

Example 10 provides, in accordance with one or more embodiments of the present disclosure, a speech synthesis apparatus comprising: the first acquisition module is used for acquiring a text to be synthesized and an appointed emotion type; the extraction module is used for extracting a phoneme sequence corresponding to the text to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes; a synthesis module, configured to input the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, where an audio frame corresponding to each phoneme in the target audio matches an acoustic feature corresponding to the phoneme in an acoustic feature sequence, where the acoustic feature sequence is determined by the speech synthesis model according to the phoneme sequence, the acoustic feature sequence includes an acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate a prosodic feature of the phoneme.

Example 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the steps of the methods of examples 1-9, in accordance with one or more embodiments of the present disclosure.

Example 12 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the methods of examples 1 to 9.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of speech synthesis, the method comprising:

acquiring a text to be synthesized and an appointed emotion type;

2. The method of claim 1, further comprising:

acquiring a tone color code corresponding to a specified speaker;

the inputting the phoneme sequence and the specified emotion type into a pre-trained speech synthesis model to obtain a target audio with the specified emotion type corresponding to the text to be synthesized, which is output by the speech synthesis model, includes:

and coding the phoneme sequence, the appointed emotion type and the tone, inputting the phoneme sequence, the appointed emotion type and the tone into the voice synthesis model to obtain the target audio output by the voice synthesis model, wherein the target audio has the tone of the appointed speaker, the voice synthesis model is obtained by training according to the linguistic data corresponding to a plurality of speakers, and the speakers comprise the appointed speaker.

3. The method of claim 1, further comprising:

acquiring a specified acoustic feature sequence, wherein the specified acoustic feature sequence comprises specified acoustic features corresponding to each phoneme;

and inputting the phoneme sequence, the specified emotion type and the specified acoustic feature sequence into the speech synthesis model to obtain the target audio output by the speech synthesis model, wherein the audio frame corresponding to each phoneme in the target audio is matched with the specified acoustic feature corresponding to the phoneme in the specified acoustic feature sequence.

4. The method according to any of claims 1-3, wherein the speech synthesis model is used to:

determining a text feature sequence corresponding to the text to be synthesized according to the phoneme sequence, wherein the text feature sequence comprises a text feature corresponding to each phoneme;

determining the acoustic feature sequence according to the text feature sequence;

determining appointed emotional characteristics corresponding to the appointed emotional types, and expanding the appointed emotional characteristics according to the phoneme sequence to obtain an emotional characteristic sequence;

and generating the target audio according to the text feature sequence, the acoustic feature sequence and the emotion feature sequence.

5. The method of claim 4, wherein the speech synthesis model includes a plurality of global style flags GST trained in advance, and the determining the specific emotion characteristics corresponding to the specific emotion types includes:

determining an emotion encoding for the specified emotion type;

and using the emotion codes of the specified emotion types as weighting coefficients of the GSTs to obtain the specified emotion characteristics.

6. The method of claim 1, wherein the acoustic features comprise: at least one of fundamental frequency, volume and speech rate.

7. The method according to any of claims 1-3, wherein the speech synthesis model is obtained by training as follows:

determining a real acoustic feature sequence according to a training audio corresponding to a training text and a training phoneme sequence corresponding to the training text, wherein the real acoustic feature sequence comprises: real acoustic features corresponding to each training phoneme in the training phoneme sequence;

obtaining emotion audio, wherein the emotion audio has an emotion type which is the same as the emotion type of the training audio;

inputting the training phoneme sequence, the emotion audio and the real acoustic feature sequence into the speech synthesis model, and training the speech synthesis model according to the output of the speech synthesis model and the training audio, wherein the speech synthesis model is used for determining training emotion features according to the emotion audio, and the training emotion features are used for representing emotion types of the emotion audio.

8. The method of claim 7, wherein the speech synthesis model comprises: the device comprises a text encoder, an acoustic feature encoder, a GST module and a synthesizer; the GST module comprises a reference encoder and an emotion mark layer, wherein the emotion mark layer comprises a plurality of GSTs; a stopping structure is arranged between the text encoder and the acoustic feature encoder and is used for stopping the acoustic feature encoder from transmitting the gradient back to the text encoder;

the inputting the training phoneme sequence, the emotion audio and the real acoustic feature sequence into the speech synthesis model, and training the speech synthesis model according to the output of the speech synthesis model and the training audio, including:

extracting a training text feature sequence corresponding to the training text through the text encoder, wherein the training text feature sequence comprises a training text feature corresponding to each training phoneme;

extracting a predicted acoustic feature sequence corresponding to the training text feature sequence through the acoustic feature encoder, wherein the predicted acoustic feature sequence comprises a predicted acoustic feature corresponding to each training phoneme;

extracting a reference vector corresponding to the emotion audio through the reference encoder, and determining a plurality of training weighting coefficients corresponding to the GSTs according to the reference vector through the emotion mark layer;

and generating the output of the voice synthesis model according to the training text feature sequence, the real acoustic feature sequence and the training emotional feature sequence through the synthesizer, wherein the training emotional feature sequence is obtained by extending the training emotional features according to the training phoneme sequence, and the training emotional features are determined according to the training weighting coefficients and the plurality of GSTs.

9. The method of claim 8, wherein the loss function of the speech synthesis model is determined by a first loss determined by the output of the speech synthesis model and the training audio, a second loss determined by the predicted acoustic feature sequence and the real acoustic feature sequence, and a third loss determined by the training weighting factor and an emotion encoding of the emotion type possessed by the emotion audio.

10. A speech synthesis apparatus, characterized in that the apparatus comprises:

11. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1-9.

12. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 9.