CN111128118B

CN111128118B - Speech synthesis method, related device and readable storage medium

Info

Publication number: CN111128118B
Application number: CN201911393613.5A
Authority: CN
Inventors: 周良; 王志鹍; 江源; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2024-02-13
Anticipated expiration: 2039-12-30
Also published as: CN111128118A

Abstract

The application discloses a voice synthesis method, related equipment and a readable storage medium, wherein after a text to be subjected to voice synthesis is obtained, emotion codes corresponding to the text are determined, voice synthesis parameters of the text are obtained by utilizing the emotion codes corresponding to the text, and voice synthesis processing is performed on the voice synthesis parameters of the text to obtain voice corresponding to the text. In the above scheme, since the emotion encoding corresponding to the text can indicate the emotion intensity of the text when the text is synthesized, and the user can control the emotion encoding corresponding to the text according to the emotion intensity demand of the user for the speech synthesis, the speech corresponding to the text obtained by using the emotion encoding corresponding to the text is the speech meeting the emotion intensity demand of the user for the speech synthesis.

Description

Speech synthesis method, related device and readable storage medium

Technical Field

The present invention relates to the field of speech synthesis technology, and more particularly, to a speech synthesis method, a related device, and a readable storage medium.

Background

With the development of speech synthesis technology, the evaluation criteria of speech synthesis are not only scoring of naturalness and the like, but also the requirements on emotion expressivity of synthesized audio are higher and higher. However, the conventional speech synthesis method generally synthesizes text directly into speech, or only synthesizes a single speech, and cannot control the emotion intensity of the synthesized speech.

Therefore, a speech synthesis method capable of controlling the emotion intensity of a synthesized speech is required.

Disclosure of Invention

In view of the foregoing, the present application provides a speech synthesis method, a related apparatus, and a readable storage medium. The specific scheme is as follows:

a method of speech synthesis, comprising:

acquiring a text to be subjected to voice synthesis;

determining emotion codes corresponding to the text, wherein the emotion codes are used for indicating emotion strength of voice synthesis;

determining speech synthesis parameters of the text based on the emotion encoding;

and performing voice synthesis processing on the voice synthesis parameters of the text to obtain voice corresponding to the text.

Optionally, the determining the emotion encoding corresponding to the text includes:

inputting the text into a text emotion code recognition model to obtain emotion codes corresponding to the text, wherein the text emotion code recognition model is obtained by pre-training an emotion recognition training text marked with emotion codes.

Optionally, determining the emotion encoding corresponding to the text includes:

acquiring a preset initial emotion code corresponding to the text;

and determining the emotion code corresponding to the text based on the initial emotion code.

Optionally, the determining, based on the initial emotion encoding, the emotion encoding corresponding to the text includes:

taking the initial emotion code as an emotion code corresponding to the text;

or alternatively, the first and second heat exchangers may be,

acquiring emotion intensity information, wherein the emotion intensity information is used for indicating emotion intensity requirements of a user on synthesized voice;

based on the emotion intensity information, adjusting the initial emotion encoding by using an interpolation method, wherein the adjusted emotion encoding is used as the emotion encoding corresponding to the text.

Optionally, the obtaining the preset initial emotion encoding corresponding to the text includes:

acquiring an emotion label corresponding to the text;

and determining an emotion code corresponding to the emotion label based on a preset corresponding relation between the emotion label and the emotion code, and taking the emotion code as an initial emotion code corresponding to the text.

Optionally, the determining, based on the emotion encoding, a speech synthesis parameter of the text includes:

acquiring a text unit sequence of the text;

and inputting the emotion codes and the text unit sequences into a fusion model to obtain voice synthesis parameters of the text output by the fusion model, wherein the fusion model is obtained by training the emotion codes of training voices and the text unit sequences of the texts corresponding to the training voices as training samples and the labeled voice synthesis parameters of the texts corresponding to the training voices as sample labels.

Optionally, the inputting the emotion code and the text unit sequence into a fusion model to obtain the speech synthesis parameters of the text output by the fusion model includes:

inputting the emotion codes and the text unit sequences into a duration model of a fusion model, and obtaining duration parameters in voice synthesis parameters of the text based on the adjusting weights of the emotion codes on the activation functions of the duration model;

inputting the emotion codes and the text unit sequences into an acoustic model of a fusion model, and obtaining acoustic parameters in voice synthesis parameters of the text based on the adjusting weights of the emotion codes on the activation functions of the acoustic model;

the adjusting weights of the emotion codes on the activating function of the time length model and the adjusting weights of the emotion codes on the activating function of the acoustic model are obtained by training by taking the time length parameter output by the time length model to approach the labeled time length parameter of the text corresponding to the training voice and the acoustic parameter output by the acoustic model to approach the labeled acoustic parameter of the text corresponding to the training voice as a training target.

A speech synthesis apparatus comprising:

an acquisition unit configured to acquire a text to be subjected to speech synthesis;

the emotion code determining unit is used for determining emotion codes corresponding to the text, and the emotion codes are used for indicating emotion intensity of voice synthesis;

a speech synthesis parameter determining unit configured to determine a speech synthesis parameter of the text based on the emotion encoding;

and the voice synthesis processing unit is used for carrying out voice synthesis processing on the voice synthesis parameters of the text to obtain voice corresponding to the text.

Optionally, the emotion encoding determination unit includes:

the first determining unit is used for inputting the text into a text emotion code recognition model to obtain emotion codes corresponding to the text, wherein the text emotion code recognition model is obtained by training text pre-training through emotion recognition marked with emotion codes.

Optionally, the emotion encoding determination unit includes:

the initial emotion code acquisition unit is used for acquiring a preset initial emotion code corresponding to the text;

and the second determining unit is used for determining the emotion encoding corresponding to the text based on the initial emotion encoding.

Optionally, the second determining unit includes:

the first determining subunit is used for taking the initial emotion code as an emotion code corresponding to the text;

or alternatively, the first and second heat exchangers may be,

the emotion intensity information acquisition unit is used for acquiring emotion intensity information, wherein the emotion intensity information is used for indicating emotion intensity requirements of a user on the voice to be synthesized;

and the second determining subunit is used for adjusting the initial emotion encoding by utilizing an interpolation method based on the emotion intensity information, and the adjusted emotion encoding is used as the emotion encoding corresponding to the text.

Optionally, the initial emotion encoding acquisition unit includes:

the emotion label acquisition unit is used for acquiring emotion labels corresponding to the text;

the initial emotion code acquisition subunit is used for determining the emotion code corresponding to the emotion label based on the preset corresponding relation between the emotion label and the emotion code, and taking the emotion code as the initial emotion code corresponding to the text.

Optionally, the speech synthesis parameter determining unit includes:

a text unit sequence obtaining unit, configured to obtain a text unit sequence of the text;

the fusion model processing unit is used for inputting the emotion codes and the text unit sequences into a fusion model to obtain voice synthesis parameters of the texts output by the fusion model, wherein the fusion model is obtained by training the emotion codes of training voices and the text unit sequences of the texts corresponding to the training voices as training samples and the labeled voice synthesis parameters of the texts corresponding to the training voices as sample labels.

Optionally, the fusion model processing unit includes:

the time length model processing unit is used for inputting the emotion codes and the text unit sequences into a time length model of a fusion model, and obtaining time length parameters in voice synthesis parameters of the text based on the adjusting weights of the emotion codes on the activation functions of the time length model;

the acoustic model processing unit is used for inputting the emotion codes and the text unit sequences into an acoustic model of a fusion model, and obtaining acoustic parameters in voice synthesis parameters of the text based on the adjusting weights of the emotion codes on the activation functions of the acoustic model;

A speech synthesis apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech synthesis method as described above.

A readable storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the speech synthesis method as described above.

By means of the technical scheme, the application discloses a voice synthesis method, related equipment and a readable storage medium, after a text to be subjected to voice synthesis is obtained, emotion codes corresponding to the text are determined, voice synthesis parameters of the text are obtained by utilizing the emotion codes corresponding to the text, and voice synthesis processing is performed on the voice synthesis parameters of the text to obtain voice corresponding to the text. In the above scheme, since the emotion encoding corresponding to the text can indicate the emotion intensity of the text when the text is synthesized, and the user can control the emotion encoding corresponding to the text according to the emotion intensity demand of the user for the speech synthesis, the speech corresponding to the text obtained by using the emotion encoding corresponding to the text is the speech meeting the emotion intensity demand of the user for the speech synthesis.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of determining emotion tags corresponding to text by using emotion recognition models according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an interpolation method of 3-dimensional emotion encoding disclosed in an embodiment of the present application;

FIG. 4 is a schematic diagram of an interpolation method of 2-dimensional emotion encoding disclosed in an embodiment of the present application;

fig. 5 is a schematic diagram of a specific structure of a fusion model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a hardware structure of a speech synthesis apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Next, a speech synthesis method provided in the present application will be described by the following examples.

Referring to fig. 1, fig. 1 is a flow chart of a speech synthesis method according to an embodiment of the present application, where the method may include:

s101: and acquiring a text to be subjected to voice synthesis.

The Text To Speech (TTS) technology generally converts Text into Speech, and in applications such as story machines, intelligent children toys, and man-machine interaction systems, the Text To be Speech synthesized may be Text that requires emotion control when synthesizing Speech, such as story Text, dialogue Text, and the like.

In the application, the text to be subjected to voice synthesis can be obtained based on a mode of uploading by a user, and the text to be subjected to voice synthesis can also be obtained based on a mode of communication connection with other systems, for example, the text to be subjected to voice synthesis can be connected with a human-computer interaction system, and the text input by the user in the human-computer interaction system can be obtained.

S102: and determining emotion codes corresponding to the texts, wherein the emotion codes are used for indicating emotion strength of voice synthesis.

The existing speech synthesis method generally synthesizes the story text or dialogue text directly into speech mechanically, the synthesized speech is generally mechanical, no emotion exists, and the experience is poor when a user listens to the story or interacts with the machine. In addition, some speech synthesis methods can only synthesize story text or dialogue text into single emotion speech, and cannot control emotion intensity of the synthesized speech, so that the synthesized speech can have excessively strong emotion or excessively weak emotion, and the expressive force is insufficient.

In order to solve the above problems, in the present application, before a text to be subjected to speech synthesis is subjected to speech synthesis, an emotion code corresponding to the text is determined, and because the emotion code can indicate emotion strength of speech synthesis, speech generated after the text is subjected to speech synthesis based on the emotion code is more emotional, so that the speech is more personified, and the experience of a user when listening to speech is improved.

In the application, the emotion encoding corresponding to the text is an encoding corresponding to emotion intensity required for speech synthesis of the text, the emotion intensity required for speech synthesis can be various, the encoding can be a vector with a preset dimension, such as a vector with 2 dimensions, 3 dimensions or higher, and different vector values correspond to different emotion intensities.

For ease of understanding, assuming that emotion encoding is a 3-dimensional vector, in which a first dimension indicates happiness, a second dimension indicates anger, a third dimension indicates sadness, and a value of a dimension indicates an intensity of emotion corresponding to the dimension, emotion encoding obtained by encoding different emotion intensities using the 3-dimensional vector is specifically as follows: [0 0 0] The emotion intensity is neutral, [ 10 ] is happy, [ 01 ] is angry, 01 is sad, 0 0.5 is weak, 0.5 0.5 can be close to sad and angry.

In this application, the emotion intensities corresponding to the text may be determined in different manners, and specifically will be described in detail by the following examples, which are not further described.

S103: and determining the voice synthesis parameters of the text based on the emotion encoding.

When the text is converted into the voice based on the voice synthesis technology, the voice synthesis parameters of the text need to be determined, and in the application, besides the mode of determining the voice synthesis parameters in the prior art, the influence of emotion encoding of the text on the voice synthesis parameters of the text is further considered.

As an embodiment, the speech synthesis parameters of the text may include a duration parameter and an acoustic parameter, wherein the duration parameter may include a duration characteristic of each text unit, the acoustic parameter may include an acoustic characteristic of each text unit, and the acoustic characteristic may include a spectral characteristic and a fundamental frequency characteristic. The text units may be phonemes, syllables, words, etc.

S104: and performing voice synthesis processing on the voice synthesis parameters of the text to obtain voice corresponding to the text.

In the application, the vocoder can be utilized to perform voice synthesis processing on voice synthesis parameters of the text to obtain voice corresponding to the text. Because the influence of emotion encoding of the text on the voice synthesis parameters of the text is considered when the voice synthesis parameters of the text are determined, and the emotion encoding characterizes the emotion intensity of voice synthesis, the determined voice synthesis parameters of the text have the expressive power of emotion intensity.

In the present application, several embodiments for determining emotion encoding corresponding to text are provided, which are specifically as follows:

as an implementation manner, in the present application, the text may be input into a text emotion code recognition model, so as to obtain an emotion code corresponding to the text, where the text emotion code recognition model is obtained by training a text with emotion recognition marked with emotion codes, and a specific structure of the text emotion code recognition model may be various neural network models, such as LSTM (long short-term memory network model).

In this way, the obtained emotion code can represent the emotion intensity of the user, such as "a little happy", "very happy", "five tastes miscellaneous", and the like. Specifically, during training, different vectors are used as labels of various emotion intensities. For example, [ 0.0.5 ] may represent a weaker sad emotion, and [ 0.5.0.5 ] may be close to the emotion of sadness and anger.

As another implementation manner, the emotion encoding corresponding to the text can be determined based on the following steps in the application, which are specifically as follows:

s201: and acquiring a preset initial emotion code corresponding to the text.

In the application, the emotion label corresponding to the text can be obtained first, and then the emotion code corresponding to the emotion label is determined based on the preset corresponding relation between the emotion label and the emotion code and used as the initial emotion code corresponding to the text.

In general, there are four types of emotion commonly used, namely neutral, happy, sad and anger, so in the present application, there may be preset four types of emotion labels, namely neutral, happy, sad and anger, and it is assumed that the preset emotion label and emotion code have the following corresponding relationship:

neutral	[0 0 0]
		Happy	[1 0 0]
Anger	[0 1 0]
		Sadness of	[0 0 1]

If the emotion label corresponding to the obtained text is happy, the emotion code [ 10 ] corresponding to the happy text is the initial emotion code corresponding to the text.

When the emotion labels corresponding to the texts are obtained, the texts can be input into text emotion recognition models to obtain the emotion labels corresponding to the texts, and the text emotion recognition models are obtained by pre-training emotion recognition training texts marked with emotion label information.

It should be noted that, the specific structure of the text emotion recognition model may be various neural network models, for example, LSTM (long short-term memory network model).

For ease of understanding, referring to fig. 2, fig. 2 is a schematic diagram of determining emotion tags corresponding to texts by using emotion recognition models according to an embodiment of the present application. As can be seen in the figure, for each word in the sentence (W shown in FIG. 2 ₁ ,W ₂ ,…W _n ) Encoding is performed to obtain the code of each word (E shown in FIG. 2 ₁ ,E ₂ ,…E _n ) Then get the hidden layer state of each word (h shown in figure 2 ₁ ,h ₂ ,…h _n ) Finally, the hidden layer h of the last word is formed _n Mapped to emotion tags.

The method is characterized in that Word2Vec is used for encoding each Word in a text, an LSTM model is used for obtaining the hidden layer state of each Word, and a DNN (Deep Neural Networks, deep neural network) layer is connected behind the last hidden layer state and mapped to an emotion label.

S202: and determining the emotion code corresponding to the text based on the initial emotion code.

In this application, based on the emotion encoding, various implementation manners for determining the emotion encoding corresponding to the text may be provided, which are specifically as follows:

mode one: and taking the initial emotion code as the emotion code corresponding to the text.

In this embodiment, since the initial emotion encoding can already indicate the emotion intensity to some extent, the initial emotion encoding can be directly used as the emotion encoding corresponding to the text.

However, the number of preset emotion labels is limited, and generally only four types of emotion labels are neutral, happy, sad and anger, and accordingly, the number of emotion codes is limited, and in fact, human emotion cannot be strictly classified according to neutrality, happy, sad and anger, and emotion such as 'happy little', 'very happy', 'five-taste offensive' exist, and therefore, speech synthesized based on the initial emotion codes cannot represent the actual emotion intensity of a user. To solve this problem, the following means are proposed in the present application:

mode two: acquiring emotion intensity information, wherein the emotion intensity information is used for indicating emotion intensity requirements of a user on synthesized voice; based on the emotion intensity information, adjusting the initial emotion encoding by using an interpolation method, wherein the adjusted emotion encoding is used as the emotion encoding corresponding to the text.

In the present application, emotion intensity information may be input by a user, for example, the user may input emotion intensity information of "little happiness", "very happy", "five tastes miscellaneous", or the like. Alternatively, user options may be added, where different options correspond to different intensities, such as "weak emotion", "strong emotion", and so on.

In the application, the initial emotion encoding can be adjusted by using an interpolation method based on the emotion intensity information, and the adjusted emotion encoding is used as the emotion encoding corresponding to the text, wherein a difference method comprises an interpolation method and an extrapolation method. And adjusting the initial emotion encoding by using an interpolation method, namely adjusting the range of the initial emotion encoding, so that the adjusted emotion encoding can represent emotion intensity information.

For ease of understanding, interpolation and extrapolation are described below in this application.

Assuming that emotion encoding is a 3-dimensional vector, where the first dimension represents happiness, the second dimension represents anger, and the third dimension represents sadness, the initial emotion encoding is as follows:

neutral	[0 0 0]
		Happy	[1 0 0]
Anger	[0 1 0]
		Sadness of	[0 0 1]

Wherein the value of a dimension represents the strength of emotion of the corresponding dimension, and the negative number represents emotion opposite to the emotion of the dimension.

As shown in fig. 3, fig. 3 is a schematic diagram of an interpolation method of 3-dimensional emotion encoding disclosed in an embodiment of the present application.

As can be seen from the figure, the interpolation is performed in such a way that the value of the dimension of the initial emotion encoding is adjusted based on the principle that the value of the dimension of the initial emotion encoding is not exceeded, and in the present application, the adjusted value of the dimension of the emotion encoding is not exceeded by 1, for example, [ 0.0.5 ] may represent a weak sad emotion, and [ 0.5.0.5 ] may be close to a sad and angry emotion. The extrapolation mode, that is, the value of each dimension of the initial emotion encoding is adjusted based on the principle that the value of each dimension of the initial emotion encoding is exceeded, in this application, that is, the value of each dimension of the emotion encoding after adjustment exceeds 1, for example, [ 20 ] 0 may represent a very happy emotion, and [ 02 ] may represent a very sad and angry emotion.

In the present application, the emotion encoding represented by the vector having the other dimension number may be adjusted by interpolation or extrapolation. Such as 4-dimensional emotion encoding, [ 10 0 0], [ 01 0 0], [0 01 0], [0 0 01 ] respectively represent neutral, happy, sad and angry. 0.3.5.0 indicates that the happiness is weak, and 0.0.1.5 indicates that the anger is very great.

Fig. 4 is a schematic diagram of an interpolation method of 2-dimensional emotion encoding according to an embodiment of the present application. As can be seen from FIG. 4, 2-dimensional emotion encoding, [0 0], [ 10 ], [ -1 0], [ 01 ] respectively indicate neutral, happy, sad and anger, [0.5 ] indicates a slightly weaker happy emotion, [ -0.5.5 ] indicates a combined emotion of sadness and anger, [ 02 ] indicates a very anger emotion, and the like.

In this application, a specific implementation manner of determining a speech synthesis parameter of the text based on the emotion encoding is disclosed, which may include:

s301: and acquiring a text unit sequence of the text.

In the present application, a text unit sequence of a text may be determined based on a text unit model and a prosody model, which are currently mature models, and thus will not be described in detail in the present application.

S302: and inputting the emotion codes and the text unit sequences into a fusion model to obtain voice synthesis parameters of the text output by the fusion model, wherein the fusion model is obtained by training the emotion codes of training voices and the text unit sequences of the texts corresponding to the training voices as training samples and the labeled voice synthesis parameters of the texts corresponding to the training voices as sample labels.

In this application, emotion encoding of training speech may be determined by determining emotion encoding as described above, and specific reference is made to the description of the above embodiments. The text corresponding to the training voice can be obtained based on the existing voice recognition technology, the labeled voice synthesis parameters of the text corresponding to the training voice can be obtained based on decoding the training voice, and the specific implementation is the current mature technology and is not described in detail in the application.

It should be noted that, when the fusion model is trained, compared with the training process of the model for obtaining the speech synthesis parameters in the prior art, the training sample of emotion coding of training speech is added, so that the fusion model learns the adjusting mode of emotion coding on the speech synthesis parameters.

As an implementation manner, the embodiment of the application discloses a specific structure of a fusion model, which is specifically as follows:

referring to fig. 5, fig. 5 is a schematic diagram of a specific structure of a fusion model disclosed in an embodiment of the present application, and as can be seen from fig. 5, the fusion model may be composed of a duration model and an acoustic model, where the model structures of the duration model and the acoustic model are the same as those of the existing duration model and acoustic model, and all include an input layer, a full connection layer, a feature extraction layer and an output layer. The full connection layer and the feature extraction layer are provided with an activation function, and the feature extraction layer and the output layer are provided with an activation function.

It should be noted that, unlike the existing time length model and acoustic model, in the present application, in addition to the text unit sequence, emotion codes are input in the input of the time length model, in addition to the text unit sequence, emotion codes are input in the input of the acoustic model, and in addition, the adjusting weights of the activating functions of the time length model and the activating functions of the acoustic model by the emotion codes are considered, that is, the activating functions between the full connection layer and the feature extraction layer of the time length model increase the parameters composed of the emotion codes and the first adjusting weights, the activating functions between the feature extraction layer and the output layer increase the parameters composed of the emotion codes and the second adjusting weights, the activating functions between the full connection layer and the feature extraction layer of the acoustic model increase the parameters composed of the emotion codes and the third adjusting weights, and the activating functions between the feature extraction layer and the output layer increase the parameters composed of the emotion codes and the fourth adjusting weights.

For ease of understanding, assuming that the initial activation functions between the fully connected layer and the feature extraction layer of the time duration model and the acoustic model are both tanh (wx+b), in this application, the activation functions between the fully connected layer and the feature extraction layer of the time duration model are tanh (wx+b+v) ₁₁ EC), the activation function between the fully connected layer and the feature extraction layer of the acoustic model is tanh (wx+b+v) ₂₁ EC), assuming an initial activation function tanh (wx+w) between the feature extraction layer and the output layer of the time duration model and the acoustic model _h h+b), in the present application, the durationThe activation function between the feature extraction layer and the output layer of the model is tanh (Wx+W _h h+b+V ₁₂ EC), the activation function between the feature extraction layer and the output layer of the acoustic model is tanh (wx+w) _h h+b+V ₂₂ EC), where EC represents emotion encoding, V ₁₁ First adjusting weight of activation function of duration model for emotion encoding, V ₁₂ Second adjustment weight for activation function of emotion encoding on duration model, V ₂₁ For the third adjusted weight of the emotion encoding activation function for the acoustic model, V22 is the fourth adjusted weight of the emotion encoding activation function for the acoustic model. Wx, whh, b are parameters that the duration model and acoustic model can learn.

During training, the emotion coding of the training voice is kept unchanged, the emotion coding of the training voice and the text unit sequence of the text corresponding to the training voice are used as training samples, the labeling time length parameter and the labeling acoustic parameter of the text corresponding to the training voice are used as sample labels, the time length parameter output by the time length model is close to the labeling time length parameter of the text corresponding to the training voice, the acoustic parameter output by the acoustic model is close to the labeling acoustic parameter of the text corresponding to the training voice and is used as a training target, and parameters of an integrated model can be obtained through training, namely the first adjustment weight, the second adjustment weight, the third adjustment weight, the fourth adjustment weight, parameters (Wx, whh and b) of the time length model and parameters (Wx, whh and b) of the acoustic model.

Since emotion encoding controls emotion of a voice, which is similar to most people, the training of the first adjustment weight, the second adjustment weight, the third adjustment weight, and the fourth adjustment weight can be realized by using the existing audio with a large data amount and a good quality as training voice. When the emotion voice of the specific character needs to be customized quickly, the first adjustment weight, the second adjustment weight, the third adjustment weight and the fourth adjustment weight are unchanged, parameters of an acoustic model and a duration model in the fusion model are obtained through emotion voice corpus training of the specific character, and accordingly the voice synthesis parameters output by the fusion model and the voice synthesis parameters of the specific character are output, and training efficiency is improved.

Based on the above fusion model, the present application provides a specific implementation manner of inputting the emotion code and the text unit sequence into the fusion model to obtain the speech synthesis parameters of the text output by the fusion model, where the implementation manner may include:

s401: and inputting the emotion codes and the text unit sequences into a duration model of a fusion model, and obtaining duration parameters in the voice synthesis parameters of the text based on the adjusting weights of the emotion codes to the activation functions of the duration model.

In the application, after the training is performed, the adjusting weight of the emotion code on the activation function of the time length model and the parameters of the time length model are obtained through training, so that in the application, after the emotion code and the text unit sequence are input into the time length model of the fusion model, the adjusting weight of the emotion code on the activation function of the time length model and the parameters of the time length model can be based on the emotion code, and the emotion code and the text unit sequence are processed to obtain the time length parameters in the speech synthesis parameters of the text.

S402: inputting the emotion codes and the text unit sequences into an acoustic model of a fusion model, and obtaining acoustic parameters in the voice synthesis parameters of the text based on the adjusting weights of the emotion codes to the activation functions of the acoustic model.

In this application, after the foregoing training, the adjusting weight of the emotion encoding on the activation function of the acoustic model and the parameters of the acoustic model are already trained, so in this application, after the emotion encoding and the text unit sequence are input into the acoustic model of the fusion model, the adjusting weight of the emotion encoding on the activation function of the acoustic model and the parameters of the acoustic model can be based on the emotion encoding, and the emotion encoding and the text unit sequence are processed to obtain acoustic parameters in the speech synthesis parameters of the text.

The following describes a speech synthesis apparatus disclosed in the embodiments of the present application, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to correspondingly to each other.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application. As shown in fig. 6, the voice synthesizing apparatus may include:

an acquisition unit 11 for acquiring a text to be subjected to speech synthesis;

an emotion encoding determination unit 12, configured to determine an emotion encoding corresponding to the text, where the emotion encoding is used to indicate emotion intensity of speech synthesis;

a speech synthesis parameter determining unit 13 for determining speech synthesis parameters of the text based on the emotion encoding;

and the voice synthesis processing unit 14 is used for performing voice synthesis processing on the voice synthesis parameters of the text to obtain voice corresponding to the text.

Optionally, the emotion encoding determination unit includes:

Optionally, the second determining unit includes:

or alternatively, the first and second heat exchangers may be,

Optionally, the initial emotion encoding acquisition unit includes:

Optionally, the speech synthesis parameter determining unit includes:

Optionally, the fusion model processing unit includes:

It should be noted that, the specific functional implementation of each unit is described in detail in the method embodiment, and this embodiment is not repeated.

Fig. 7 is a block diagram of a hardware structure of a speech synthesis apparatus according to an embodiment of the present application, and referring to fig. 7, the hardware structure of the speech synthesis apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;

the processor 1 may be a central processing unit CPU or an ASIC

(Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

acquiring a text to be subjected to voice synthesis;

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the application also provides a storage medium, which may store a program adapted to be executed by a processor, the program being configured to:

acquiring a text to be subjected to voice synthesis;

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech synthesis, comprising:

acquiring a text to be subjected to voice synthesis;

acquiring a preset initial emotion code corresponding to the text;

determining emotion codes corresponding to the text based on the initial emotion codes, wherein the emotion codes are used for indicating emotion intensity of voice synthesis;

determining a sequence of text units of the text based on the text unit model and the prosody model;

inputting the emotion codes and the text unit sequences into an acoustic model of a fusion model, and obtaining acoustic parameters in voice synthesis parameters of the text based on the adjusting weights of the emotion codes on the activation functions of the acoustic model; the adjusting weight of the emotion encoding on the activating function of the time length model and the adjusting weight of the emotion encoding on the activating function of the acoustic model are obtained by training by taking the time length parameter output by the time length model to approach the labeled time length parameter of the text corresponding to the training voice and the acoustic parameter output by the acoustic model to approach the labeled acoustic parameter of the text corresponding to the training voice as a training target; the fusion model is obtained by training a training speech emotion code and a text unit sequence of a training speech corresponding text as training samples and labeling speech synthesis parameters of the training speech corresponding text as sample labels;

performing voice synthesis processing on voice synthesis parameters of the text to obtain voice corresponding to the text;

the determining the emotion encoding corresponding to the text based on the initial emotion encoding comprises the following steps:

taking the initial emotion code as an emotion code corresponding to the text;

or alternatively, the first and second heat exchangers may be,

acquiring emotion intensity information, wherein the emotion intensity information is used for indicating emotion intensity requirements of a user on synthesized voice; based on the emotion intensity information, adjusting the initial emotion encoding by using an interpolation method, wherein the adjusted emotion encoding is used as the emotion encoding corresponding to the text.

2. The method of claim 1, wherein said determining the emotion encoding corresponding to the text comprises:

3. The method according to claim 1, wherein the obtaining the preset initial emotion encoding corresponding to the text includes:

acquiring an emotion label corresponding to the text;

4. A speech synthesis apparatus, comprising:

the emotion code determining unit is used for acquiring a preset initial emotion code corresponding to the text; determining emotion codes corresponding to the text based on the initial emotion codes, wherein the emotion codes are used for indicating emotion intensity of voice synthesis;

the voice synthesis processing unit is used for carrying out voice synthesis processing on the voice synthesis parameters of the text to obtain voice corresponding to the text;

the emotion encoding determination unit is specifically configured to:

taking the initial emotion code as an emotion code corresponding to the text;

or alternatively, the first and second heat exchangers may be,

acquiring emotion intensity information, wherein the emotion intensity information is used for indicating emotion intensity requirements of a user on synthesized voice; based on the emotion intensity information, adjusting the initial emotion encoding by using an interpolation method, wherein the adjusted emotion encoding is used as the emotion encoding corresponding to the text;

the speech synthesis parameter determination unit includes:

a text unit sequence acquisition unit configured to determine a text unit sequence of the text based on a text unit model and a prosody model;

the fusion model processing unit is used for inputting the emotion codes and the text unit sequences into a fusion model to obtain voice synthesis parameters of the texts output by the fusion model, wherein the fusion model is obtained by training the emotion codes of training voices and the text unit sequences of the texts corresponding to the training voices as training samples and the labeled voice synthesis parameters of the texts corresponding to the training voices as sample labels;

the fusion model processing unit comprises:

5. A speech synthesis apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the respective steps of the speech synthesis method according to any one of claims 1 to 3.

6. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the speech synthesis method according to any one of claims 1 to 3.