CN114283781A

CN114283781A - Speech synthesis method and related device, electronic equipment and storage medium

Info

Publication number: CN114283781A
Application number: CN202111650035.6A
Authority: CN
Inventors: 王瑾薇; 胡亚军; 江源
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-05

Abstract

The application discloses a voice synthesis method, a related device, electronic equipment and a storage medium, wherein the voice synthesis method comprises the following steps: acquiring a text to be synthesized, a first voice attribute and a second voice attribute; the first voice attribute comprises at least one of emotion category and style category, and the second voice attribute comprises speaker identification; acquiring global prosodic features with first voice attributes, and predicting based on the text to be synthesized, the first voice attributes and the second voice attributes to obtain local prosodic features; the global prosodic features comprise sentence-level prosodic feature information, and the local prosodic features comprise word-level prosodic feature information; and synthesizing based on the text to be synthesized, the global prosody features and the local prosody features to obtain synthesized voice. By the scheme, voices with different rhythms can be freely synthesized, and adaptability to different scenes is improved.

Description

Speech synthesis method and related device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, a related apparatus, an electronic device, and a storage medium.

Background

The speech synthesis technology is one of the branches of the artificial intelligence research field, and mainly converts text information into audible sound information, namely, a robot speaks like a human.

At present, although the speech synthesis technology can generate synthesized speech approaching to natural speech, the synthesized speech often represents the average prosody of the database, so that the speech synthesis technology is difficult to adapt to different scenes such as novel, news, customer service, anchor broadcasting and the like. For example, in an intelligent interactive scene, the existing intelligent customer service cannot share the emotion of speaking of people, and even if people are very angry, the synthesized voice is still an answer with unchanged tone; or, in the case of reading the novel, the dialogue-oriented character in the novel is not clear, and the dialogue-oriented character is synthesized voice of the same mood for different emotions of different characters. In view of the above, how to freely synthesize voices with different prosody becomes an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a voice synthesis method, a related device, an electronic device and a storage medium, which can freely synthesize voices with different rhythms and improve adaptability to different scenes.

In order to solve the above technical problem, a first aspect of the present application provides a speech synthesis method, including: acquiring a text to be synthesized, a first voice attribute and a second voice attribute; the first voice attribute comprises at least one of emotion category and style category, and the second voice attribute comprises speaker identification; acquiring global prosodic features with first voice attributes, and predicting based on the text to be synthesized, the first voice attributes and the second voice attributes to obtain local prosodic features; the global prosodic features comprise sentence-level prosodic feature information, and the local prosodic features comprise word-level prosodic feature information; and synthesizing based on the text to be synthesized, the global prosody features and the local prosody features to obtain synthesized voice.

In order to solve the above technical problem, a second aspect of the present application provides a speech synthesis apparatus, including: the system comprises an acquisition module, a global rhythm module, a local rhythm module and a synthesis module, wherein the acquisition module is used for acquiring a text to be synthesized, a first voice attribute and a second voice attribute; the first voice attribute comprises at least one of emotion category and style category, and the second voice attribute comprises speaker identification; the global prosody module is used for acquiring global prosody features with first voice attributes, and the local prosody module is used for predicting based on the text to be synthesized, the first voice attributes and the second voice attributes to obtain local prosody features; the global prosodic features comprise sentence-level prosodic feature information, and the local prosodic features comprise word-level prosodic feature information; the synthesis module is used for synthesizing based on the text to be synthesized, the global prosody feature and the local prosody feature to obtain synthesized voice.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the speech synthesis method in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being for implementing the speech synthesis method in the first aspect.

In the scheme, a text to be synthesized, a first voice attribute and a second voice attribute are obtained, the first voice attribute comprises at least one of emotion type and style type, the second voice attribute comprises speaker identification, global prosody characteristics with the first voice attribute are obtained based on the speaker identification, and prediction is carried out based on the text to be synthesized, the first voice attribute and the second voice attribute to obtain local prosody characteristics, the global prosody characteristics comprise sentence-level prosody characteristic information, the local prosody characteristics comprise word-level prosody characteristic information, synthesis is carried out based on the text to be synthesized, the global prosody characteristics and the local prosody characteristics to obtain synthesized voice, on one hand, the sentence-level prosody characteristic information is referred in the synthesis process, the whole synthesized voice can be in line with the first voice attribute, on the other hand, the word-level prosody characteristics are further referred in the synthesis process, can control the local prosody subtle change of the synthesized voice. Therefore, the voice with different rhythms can be accurately and freely synthesized, and the adaptability to different scenes is improved.

Drawings

FIG. 1 is a schematic flow chart diagram of an embodiment of a speech synthesis method of the present application;

FIG. 2 is a schematic diagram of a process for training an embodiment of a global prosody extraction network;

FIG. 3 is a schematic diagram of an embodiment of prosodic conversion;

FIG. 4 is a schematic diagram of a process for training an embodiment of a local prosody extraction network;

FIG. 5 is a schematic diagram of a process for training an embodiment of a local prosody prediction network;

FIG. 6 is a process diagram of an embodiment of transfer learning;

FIG. 7 is a schematic diagram of a process for training an embodiment of a prosodic coding network and a text coding network;

FIG. 8 is a schematic diagram of a process for training an embodiment of a synthetic network;

FIG. 9 is a process diagram of one embodiment of emotional style migration;

FIG. 10 is a block diagram of an embodiment of a speech synthesis apparatus according to the present application;

FIG. 11 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 12 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a speech synthesis method according to an embodiment of the present application.

Specifically, the method may include the steps of:

step S11: and acquiring the text to be synthesized, the first voice attribute and the second voice attribute.

In an embodiment of the present disclosure, the first voice attribute includes at least one of an emotion category and a style category. For example, the first voice attribute may include only emotion categories; alternatively, the first voice attributes may include only genre categories; alternatively, to improve the applicability of speech synthesis, the first speech attribute may include an emotion category and a style category, which are not limited herein.

In one implementation scenario, the emotion classification may be selected from a plurality of preset emotion classifications set in advance. Taking an interactive scene as an example, emotion recognition can be performed on user voice to obtain a target emotion category of the user voice, on the basis, an emotion category can be selected from a plurality of preset emotion categories to serve as a first voice attribute, and synthesized voice is obtained through subsequent steps to respond to the user voice. For example, in the case of performing emotion recognition on the user voice, and obtaining that the target emotion category is "angry", a preset emotion category "sorry" may be selected. Other cases may be analogized, and no one example is given here. Furthermore, several preset emotion categories may include, but are not limited to: comfort, loveliness, pet drowning, peeling, encouragement, sorry and the like, namely the fluctuation change of the preset emotion categories is relatively smooth, so that in various scenes, even if the emotion categories are selected wrongly in a plurality of preset emotion categories, the influence on the user experience can be reduced as much as possible. For example, the text to be synthesized is "just ask for something in the query", the correct emotion category is "neutral", and the selected emotion category is "tune skin", in this case, not only the user experience is not affected, but also the user can obtain better interactive experience, which is beneficial to greatly improving the fault tolerance of emotion category selection. Of course, in addition to this, several preset emotion categories may also include, but are not limited to: sadness, happiness, anger, surprise, question, etc., without limitation.

In one implementation scenario, the style category may be selected from a number of preset style categories that are preset. Specifically, the genre category may be selected according to the usage scenario. Further, genre categories may include, but are not limited to: novel, news, customer service, interactive, spoken, etc., without limitation.

In this embodiment of the disclosure, the second voice attribute includes a speaker identifier, it should be noted that different speakers have different speaker identifiers, and the speaker identifier is used to distinguish timbres of different speakers, so that in a voice synthesis process, a synthesized voice with an expected timbre can be synthesized according to the speaker identifier.

In one implementation scenario, the text to be synthesized may be entered in advance. Taking a novel reading scene as an example, texts such as bystanders and dialogues can be input in advance to be used as texts to be synthesized; alternatively, taking a news scene as an example, a news text may be input in advance as a text to be synthesized. Other cases may be analogized, and no one example is given here.

In one implementation scenario, the text to be synthesized may also be predicted. Taking a customer service scene as an example, a corresponding response interactive text may be generated based on a user interactive text as a text to be synthesized, and it should be noted that the user interactive text may be input by a user or obtained by recognizing a user voice. In addition, a question-and-answer model based on an Encoder-Decoder structure, for example, may be used to predict the user interaction text to obtain a corresponding response interaction text. Other cases may be analogized, and no one example is given here.

Step S12: and acquiring global prosodic features with first voice attributes, and predicting based on the text to be synthesized, the first voice attributes and the second voice attributes to obtain local prosodic features.

In the disclosed embodiment, the global prosodic features comprise sentence-level prosodic feature information. That is, the global prosodic features can reflect prosodic information of the entire sentence, so that the controlled speech can be synthesized by the entire global prosodic features.

In an implementation scenario, the synthesized speech may be synthesized based on a speech synthesis model, the speech synthesis model may be trained based on sample data, the sample data includes sample speech labeled with a sample speech attribute and sample text corresponding to the sample speech, and the global prosodic feature may be obtained based on a sample global prosodic feature of reference sample speech, and the reference sample speech may specifically be the sample speech with the first speech attribute. In the above manner, the synthesized voice is obtained through synthesis of the voice synthesis model, the voice synthesis model is obtained based on sample data training, the sample data includes sample voice labeled with the sample voice attribute and sample text corresponding to the sample voice, the global prosody feature is obtained based on the sample global prosody feature of the reference sample voice, the reference sample voice is sample voice with the first voice attribute, that is, the global prosody feature can be obtained from the sample data used for training the synthesis model, so that convenience in extracting the global prosody feature can be improved.

In a specific implementation scenario, the sample speech attribute may include a sample first speech attribute and a sample second speech attribute, the sample first speech attribute may include at least one of an emotion category of the sample speech and a style category of the sample speech, and the sample second speech attribute may include a speaker identification of the sample speech (i.e., a unique identification of a speaker who uttered the sample speech). For specific meanings of emotion categories and style categories, reference may be made to the above description, and further description is omitted here. It should be noted that, different from the foregoing first voice attribute and second voice attribute, the sample first voice attribute and the sample second voice attribute may be obtained according to the actual condition of the sample voice. For example, if the emotion type actually reflected by the sample speech is "neutral", and the scene actually generated is "news", the sample speech attribute "neutral" and "news" may be labeled thereto, and the speaker actually uttering the sample speech is speaker # 1, and the second speech attribute "001" may be labeled thereto. Other cases may be analogized, and no one example is given here.

In a specific implementation scenario, the speech synthesis model may include a global prosody extraction network, where the global prosody extraction network is used to assist in obtaining global prosody features, and details of the extraction process are not repeated herein. It should be noted that after the global prosody extraction network is trained, the global prosody features extracted by the global prosody extraction network can be decoupled from the speaker and the speaking content. Specifically, please refer to fig. 2 in combination, fig. 2 is a schematic diagram illustrating a process of training an embodiment of a global prosody extraction network. As shown in fig. 2, a first sample global prosodic feature of the sample speech may be extracted. Illustratively, the global prosody extraction network may perform global prosody extraction on the acoustic features of the sample speech to obtain first sample global prosody features. It should be noted that, in the embodiment of the present disclosure, for convenience of distinguishing, global prosodic features extracted from different data at different stages are distinguished in different naming manners, and in a real scene, the global prosodic features may all be represented by vectors with preset dimensions. On this basis, a first sample synthesized speech is synthesized based on the first sample global prosody feature and a sample text corresponding to the sample speech, prediction is performed based on the first sample global prosody feature to obtain a predicted first speech attribute of the sample speech, and a second sample global prosody feature of the first sample synthesized speech is extracted. It should be noted that, as shown in fig. 2, the speech synthesis model may further include a synthesis network, and the synthesis network includes a phoneme coder and a decoder, after the phoneme sequence of the sample text is coded by the phoneme coder, a phoneme feature may be obtained, after the first sample global prosody feature is weighted by the weighting factor, the first sample global prosody feature and the phoneme feature are input to the decoder together for decoding, and an acoustic feature is obtained by synthesis, and on the basis of the acoustic feature obtained by synthesis, the first sample synthesized speech may be further synthesized, and the specific synthesis process is not described herein again. In addition, similar to the sample first speech attribute, the predicted first speech attribute may include at least one of a predicted emotion category and a predicted style category of the sample speech, and the specific meaning may refer to the foregoing description about the emotion category and the style category, which is not described herein again. On this basis, at least network parameters of the global prosody extraction network may be adjusted based on a difference between the predicted first speech attribute and the sample first speech attribute, and a difference between the first sample global prosody feature and the second sample global prosody feature. In particular, during the training process, the classification difference between the predicted first speech attribute and the sample first speech attribute may be minimized, and the distribution difference between the first sample global prosodic feature and the second sample global prosodic feature may be minimized. Illustratively, the classification difference between the first speech attribute and the sample first speech attribute may be predicted by a loss function metric such as cross entropy, and the distribution difference between the first sample global prosody feature and the second sample global prosody feature may be measured by a loss function such as mean square error. The specific process of the difference measurement may refer to the technical details of the loss function such as cross entropy, mean square error, etc., and is not described herein again. In addition, the Global prosody extraction network may include, but is not limited to, AE (auto encoder), VAE (variant auto encoder), CVAE (Conditional variant auto encoder), GST (Global Style keys), and the like, which are not limited herein. It should be noted that, in the process of adjusting the network parameters, the network parameters of the network can also be integrated into a coherent form. Of course, the network parameters of the global prosody extraction network may also be adjusted based on the difference between the predicted first speech attribute and the sample first speech attribute until the global prosody extraction network training converges. On the basis, the network parameters of the network are integrated by adjusting the difference between the first sample global prosodic feature and the second sample global prosodic feature so as to train the synthetic network until convergence, and the training mode is not limited. In the above manner, the speech synthesis model includes a global prosody extraction network, and the global prosody extraction network is configured to extract sample global prosody features, where the sample speech attributes at least include a sample first speech attribute, a first sample global prosody feature of the sample speech is extracted based on the sample global prosody feature, a first sample synthesized speech is synthesized based on the sample global prosody feature and a sample text corresponding to the sample speech, and a prediction is performed based on the first sample global prosody feature to obtain a predicted first speech attribute of the sample speech, and a second sample global prosody feature of the first sample synthesized speech is extracted, on this basis, at least a network parameter of the global prosody extraction network is adjusted based on a difference between the predicted first speech attribute and the sample first speech attribute, and a difference between the first sample global prosody feature and the second sample global prosody feature, therefore, the global prosody extraction network can be decoupled with information such as speakers and texts in the feature extraction process through the classification loss constraint of the voice attributes and the distribution loss constraint of the prosody features, so that the extracted global prosody features only represent feature information such as emotion and style, and the accuracy of the global prosody features is improved.

In a specific implementation scenario, the global prosody feature having the first speech attribute may be obtained based on the convergence of the global prosody extraction network training. Specifically, the sample global prosodic features of each reference sample voice may be fused to obtain the global prosodic features. Exemplarily, sample voices with the first voice attribute being the same as the labeled sample voice attribute can be obtained, feature extraction is respectively performed on the sample voices by adopting a global prosody extraction network with convergent training to obtain corresponding sample global prosody features, and on the basis, the sample global prosody features are processed by means of averaging, sampling and the like to obtain the global prosody features with the first voice attribute. Alternatively, the sample global prosody feature of the first target sample speech may be used as the global prosody feature, where it is to be noted that the first target sample speech is selected from the reference sample speech, and the text similarity between the sample text corresponding to the first target sample speech and the text to be synthesized satisfies the first condition. For example, the text similarity may be calculated for sample texts of the text to be synthesized and corresponding to the reference sample voices, respectively, and the reference sample voice with the highest text similarity may be taken as the first target sample voice, that is, the first condition may be specifically set to be that the text similarity is highest. On the basis, feature extraction can be performed on the first target sample voice by adopting a global prosody extraction network after training convergence to obtain a global prosody feature with a first voice attribute. It should be noted that the text similarity may be obtained by processing the text to be synthesized and the sample text by using a text matching model, and the text matching model may be obtained by adjusting a downstream task by using a big data unsupervised network, or may be calculated by using rules such as template matching, which is not limited herein. Alternatively, a sample global prosody feature of the second target sample speech may also be used as the global prosody feature, and similarly to the first target sample speech, the second target sample speech is also selected from the reference sample speech, and the presentation strength of the second target sample speech with respect to the first speech attribute satisfies the second condition. For example, the global prosody extraction network after the training convergence may be used to perform feature extraction on each reference sample voice to obtain the sample global prosody features of each reference sample voice, based on the above, classification prediction can be performed based on sample global prosody characteristics of each reference sample voice, so as to obtain a prediction probability value of each reference sample voice about emotion category in the first voice attribute (and/or lattice category in the first voice attribute), thus, the reference sample voice corresponding to the highest prediction probability value can be taken as the second target sample voice, that is, the presentation strength related to the first voice attribute can be measured by the prediction probability value of each reference sample voice related to the emotion category in the first voice attribute (and/or the lattice category in the first voice attribute), and the second condition may specifically be set to be the highest presentation strength (i.e. the highest prediction probability value) for the first speech attribute. In the above manner, the global prosody feature is obtained by fusing the sample global prosody features of the reference sample voices, or the sample global prosody feature of the first target sample voice is used as the global prosody feature, or the sample global prosody feature of the second target sample voice is used as the global prosody feature, and the first target sample voice and the second target sample voice are both obtained by selecting from the reference sample voices, and the text similarity between the sample text corresponding to the first target sample voice and the text to be synthesized satisfies the first condition, and the presentation strength of the second target sample voice with respect to the first voice attribute satisfies the second condition.

In the disclosed embodiment, the local prosodic features include word-level prosodic feature information. That is, the local prosodic features can reflect prosodic information of a single word, so that local detailed prosodic changes can be promoted by the local prosodic features. It should be noted that, in the embodiment of the present disclosure, a prosody conversion-based manner may be selected as needed to obtain the local prosody features on the basis of the text to be synthesized, the first voice attribute, and the second voice attribute, or a transfer learning-based manner may be selected to obtain the local prosody features on the basis of the text to be synthesized, the first voice attribute, and the second voice attribute, where the two manners of prosody conversion and transfer learning are specifically described below.

In one implementation scenario, please refer to fig. 3 in combination, fig. 3 is a schematic process diagram of an embodiment of prosody conversion. As shown in fig. 3, the sound bank a is used as a source, the emotion and style of the sound bank a can be transferred to the timbre of the sound bank B, specifically, the speaker ID (i.e., speaker identifier) of the source and the corresponding emotion category and style category can be used to perform prosody conversion, generate local prosody characteristics, perform speech synthesis by combining global prosody characteristics, and refer to the speaker ID (i.e., speaker identifier) of the sound bank B in the synthesis process, so as to generate a synthesized speech having a target timbre (i.e., speaker timbre of the sound bank B) and conforming to the prosody (i.e., emotion and style of the sound bank a) of the source. Therefore, in the training process, only the local prosody prediction network of the target speaker needs to be trained. Of course, in order to further increase the application range of speech synthesis, the local prosody prediction network of each speaker may also be trained, so that when the target speaker changes, only the local prosody prediction network of the corresponding speaker needs to be selected, and retraining is not needed.

Specifically, as described above, the synthesized speech may be synthesized based on a speech synthesis model, and the speech synthesis model may further include a local prosody prediction network for directly predicting a local prosody feature based on the text to be synthesized, the first speech attribute, and the second speech attribute. According to the mode, the text to be synthesized, the first voice attribute and the second voice attribute are directly predicted through the local prosody prediction network to obtain the local prosody feature, and convenience in obtaining the local prosody feature is improved.

In a specific implementation scenario, the speech synthesis model may further include a local prosody extraction network in addition to the global prosody extraction network, and as described above, the speech synthesis model may be obtained by training based on sample data, the sample data may include sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the local prosody prediction network may be obtained by training based on the sample data after the local prosody extraction network training converges. It should be noted that, similar to the global prosody extraction network, the local prosody extraction network and the local prosody prediction network may include, but are not limited to: AE. VAE, CVAE, GST, etc., and are not limited herein. In the above manner, the voice synthesis model further includes a local prosody extraction network, the voice synthesis model is obtained based on sample data training, the sample data includes sample voices labeled with sample voice attributes and sample texts corresponding to the sample voices, and the local prosody prediction network is obtained based on the sample data training after the local prosody extraction network training converges, so that the local prosody prediction network can be further trained after the local prosody extraction network training to assist the training of the local prosody prediction network, and the accuracy of the local prosody prediction network is favorably improved.

In a specific implementation scenario, please refer to fig. 4 in combination in the process of training the local prosody extraction network, and fig. 4 is a schematic process diagram of an embodiment of the process of training the local prosody extraction network. As shown in fig. 4, a first sample local prosodic feature of the sample speech may be extracted, and a first text feature of the sample text may be extracted. It should be noted that the local prosody extraction network may be used to extract the first sample local prosody feature of the sample speech. In addition, the first text feature of the sample text may be extracted by using a pre-training language model such as BERT (Bidirectional Encoder from Transformers), and the technical details of the pre-training language model such as BERT may be specifically referred to, and are not described herein again. On the basis, prediction can be performed based on the first sample local prosody feature to obtain a predicted second voice attribute of the sample voice, a second sample synthesized voice is synthesized based on the first sample local prosody feature and a sample text corresponding to the sample voice, and a second sample local prosody feature of the second sample synthesized voice is extracted. It should be noted that, as described above, the speech synthesis model may include a synthesis network, and the first sample local prosodic feature and the sample text corresponding to the sample speech may be input into the synthesis network to obtain the second sample synthesized speech, and the specific process may refer to the foregoing description related to the first sample synthesized speech, and is not described herein again. Meanwhile, a local prosody extraction network can be adopted to perform local prosody extraction on the second sample synthesized speech to obtain local prosody extraction features of the second sample. In addition, in the embodiment of the present disclosure, in order to facilitate the distinction, in different stages, different names are used for the local prosody features respectively extracted by the local prosody extraction network for different data, and in a real scene, the local prosody features can be represented by vectors of preset dimensions. On this basis, at least network parameters of the local prosody extraction network may be adjusted based on a difference between the predicted second speech attribute and the sample second speech attribute, a difference between the first sample local prosody feature and the first text feature, and a difference between the first sample local prosody feature and the second sample local prosody feature. It should be noted that the classification difference between the second speech attribute and the sample second speech attribute can be maximally predicted, the distribution difference between the first sample local prosody feature and the second sample local prosody feature is minimized, and the mutual information between the first sample local prosody feature and the first text feature is minimized, so that the local prosody feature extracted by the local prosody extraction network can be decoupled from the speaker information and the text information, and further, the local prosody feature only contains the feature information which is only related to the prosody information, such as the fluctuation of the fundamental frequency, the pronunciation duration and the like. Further, in a real-world scenario, a loss function metric, such as cross entropy, may be used to predict a classification difference between the second speech attribute and the sample second speech attribute, and a loss function metric, such as mean square error, may be used to measure a distribution difference between the first sample local prosodic feature and the second sample local prosodic feature. Further, in the training process, the local prosody extraction network may be adjusted based on a difference between the predicted second speech attribute and the sample second speech attribute and a difference between the first sample local prosody feature and the first text feature to train the local prosody extraction network until convergence. On the basis, the network parameters of the synthetic network can be adjusted based on the difference between the first sample local prosodic feature and the second sample local prosodic feature, so that the training can be gradually carried out in two stages in the training process, the training difficulty can be reduced as much as possible, and the training efficiency can be improved. In the above manner, the sample speech attributes at least include sample second speech attributes, and in the training process, the first sample local prosody feature of the sample speech is extracted, the first text feature of the sample text is extracted, then the prediction is performed based on the first sample local prosody feature to obtain the predicted second speech attribute of the sample speech, and a second sample synthesized speech is synthesized based on the first sample local prosody feature and the sample text corresponding to the sample speech, and the second sample local prosody feature of the second sample synthesized speech is extracted, on this basis, the network parameters of the local prosody extraction network are at least adjusted based on the difference between the predicted second speech attribute and the sample second speech attribute, the difference between the first sample local prosody feature and the first text feature, and the difference between the first sample local prosody feature and the second sample local prosody feature, the local prosody features extracted by the local prosody extraction network can be decoupled from the speaker information and the text information, so that the local prosody features only contain characteristic information which is only related to the prosody information, such as fundamental frequency fluctuation, pronunciation duration and the like, and the accuracy of the local prosody features is favorably improved.

In a specific implementation scenario, the local prosody prediction network may be further trained on the basis of the convergence of the local prosody extraction network training. Referring to fig. 5, fig. 5 is a schematic diagram illustrating a process of training a local prosody prediction network according to an embodiment. In the training process, a third sample local prosody feature of the sample voice can be extracted based on the local prosody extraction network, a sample text corresponding to the sample voice and a sample voice attribute marked by the sample voice are predicted based on the local prosody prediction network, and a predicted local prosody feature is obtained. Specifically, the local prosody prediction network may take a phoneme sequence and text information of a sample text as input in a processing process, extract semantic information in the text information through a pre-training language model, and fuse the semantic information with the phoneme sequence through alignment coding (including but not limited to an attention mechanism, an autoregressive model, and the like) to obtain a predicted local prosody feature. In the process, the third sample local prosody feature extracted after the sample voice passes through the convergent local prosody extraction network can be used as a target constraint local prosody prediction network, and the third sample local prosody feature and the predicted local prosody feature are close to each other as much as possible by minimizing the difference between the third sample local prosody feature and the predicted local prosody feature, so that the accuracy of the local prosody prediction network can be improved as much as possible. Specifically, mutual information between the third sample local prosody feature and the predicted local prosody feature may be calculated, and the network parameters of the local prosody prediction network may be adjusted by maximizing the mutual information in the training process. In the above manner, the third sample local prosody feature of the sample voice is extracted based on the local prosody extraction network, and the sample text corresponding to the sample voice and the sample voice attribute labeled by the sample voice are predicted based on the local prosody prediction network to obtain the predicted local prosody feature.

In one implementation scenario, please refer to fig. 6 in combination, and fig. 6 is a schematic process diagram of an embodiment of the transfer learning. As shown in fig. 6, unlike the prosody conversion described above, the transfer learning uses mixed data of a plurality of speakers with a plurality of styles and a plurality of emotions, and the generation of local prosodic features is controlled by a speaker ID (i.e., speaker identification), an emotion ID (i.e., emotion type), and a style ID (i.e., style type). The target speaker ID (namely the target speaker identification) is adopted to convert different emotion IDs (namely emotion types) and style IDs (namely style types) to generate local prosodic features, so that the target speaker can only change emotion and style control, speech synthesis of different emotions in different styles is realized, and integration and flexible control of an engine development system are facilitated.

Specifically, the synthesized speech may be synthesized based on a speech synthesis model, and the speech synthesis model includes a prosody coding network and a text coding network, the text coding network is configured to predict a speech attribute feature based on a text to be synthesized and a first speech attribute, the speech attribute feature is unrelated to a speaker, the prosody coding network is configured to predict a local prosody feature based on the speech attribute feature, the text to be synthesized, and a second speech attribute, it should be noted that the prosody coding network may include, but is not limited to, CVAE, and the like, which is not limited herein. According to the method, the text to be synthesized and the first voice attribute are predicted by using the text coding network to obtain the voice attribute characteristics, the text to be synthesized and the second voice attribute are predicted by using the prosody coding network to obtain the local prosody characteristics, and the voice attribute characteristics are irrelevant to the speaker, so that the voice attributes of emotion, style, speaker and the like can be further decoupled through the prosody coding network, the control force for controlling emotion and style is favorably improved, and the distinctiveness of emotion and style is improved.

In a specific implementation scenario, as described above, the speech synthesis model may further include a local prosody extraction network, and reference may be made to the foregoing related description for a network structure of the local prosody extraction network and a training process, which is not described herein again. In addition, the voice synthesis model can be obtained based on sample data training, the sample data comprises sample voice marked with sample voice attributes and sample texts corresponding to the sample voice, the prosody coding network and the text coding network are obtained based on the sample data training after the local prosody extraction network is trained to converge, the prosody coding network and the text coding network can be further trained after the local prosody extraction network is trained, the training of the prosody coding network and the text coding network is assisted by training the converged local prosody extraction network, and the accuracy of the prosody coding network and the accuracy of the text coding network are improved.

In one specific implementation scenario, please refer to fig. 7 in combination, and fig. 7 is a schematic process diagram of an embodiment of training the prosody coding network and the text coding network. As shown in fig. 7, during the training process of the prosody coding network and the text coding network, a fourth sample local prosody feature of the sample speech may be extracted first. It should be noted that the fourth sample local prosody feature can be extracted by using the aforementioned local prosody extraction network. On the basis, the sample voice attribute feature of the fourth sample local prosody feature can be extracted, and the predicted voice attribute feature is obtained through prediction based on the sample text and the sample first voice attribute. It should be noted that, the prosody coding network may be used to perform feature extraction on the fourth sample local prosody feature to obtain a sample voice attribute feature, and in addition, the sample text and the sample first voice attribute may be input into the text coding network to obtain a predicted voice attribute feature. Based on this, a fifth sample local prosody feature may be obtained through prediction based on the sample text, the sample second voice attribute, and the predicted voice attribute feature, and specifically, the sample text, the sample second voice attribute, and the predicted voice attribute feature may be input to a prosody coding network to obtain the fifth sample local prosody feature, so that network parameters of the prosody coding network and the text coding network may be adjusted based on a difference between the sample voice attribute feature and the predicted voice attribute feature and a difference between the fourth sample local prosody feature and the fifth sample local prosody feature. Specifically, it is possible to minimize a distribution difference between the sample voice attribute features and the predicted voice attribute features, and to minimize a distribution difference between the fourth sample local prosody feature and the fifth sample local prosody feature. It should be noted that, in a real scenario, the distribution difference between the sample speech attribute feature and the predicted speech attribute feature may be measured by a loss function such as a mean square error, and the distribution difference between the fourth sample local prosody feature and the fifth sample local prosody feature may be measured by a loss function such as a mean square error. In the above manner, the fourth sample local prosody feature of the sample voice is extracted, and the sample voice attribute feature of the fourth sample local prosody feature is extracted based on the fourth sample local prosody feature, and based on the sample text and the sample first voice attribute, predicting to obtain a predicted voice attribute characteristic, thereby predicting to obtain a fifth sample local prosody characteristic based on the sample text, the sample second voice attribute and the predicted voice attribute characteristic, on the basis of the above, based on the difference between the sample voice attribute feature and the predicted voice attribute feature, and the difference between the local prosodic features of the fourth sample and the local prosodic features of the fifth sample, network parameters of a prosodic coding network and a text coding network are adjusted, so that the speech attributes of emotion, style, speaker and the like can be further decoupled, therefore, the control capability of controlling the emotion and the style is improved, and the distinction of the emotion and the style is favorably improved.

It should be noted that, a global prosody extraction network, a local prosody prediction network, a prosody coding network, a text coding network, and the like can be obtained through training so that the above various networks can be selectively adopted to perform speech synthesis according to actual needs in a speech synthesis process, and the following description may be specifically referred to, and will not be repeated herein.

Step S13: and synthesizing based on the text to be synthesized, the global prosody features and the local prosody features to obtain synthesized voice.

In one implementation scenario, as described above, the synthesized speech may be synthesized based on a speech synthesis model, where the speech synthesis model includes a global prosody extraction network, a local prosody extraction network, and a synthesis network, the speech synthesis model is trained based on sample data, the sample data includes sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the synthesis network is trained based on the sample data after the global prosody extraction network and the local prosody extraction network are respectively trained and converged. According to the mode, after the global prosody extraction network and the local prosody extraction network are trained and converged respectively, the synthetic network is trained further, and the voice synthesis effect is improved.

In one specific implementation scenario, please refer to fig. 8 in combination, and fig. 8 is a schematic process diagram of an embodiment of training a synthetic network. As illustrated in fig. 8, a third sample global prosodic feature and a sixth sample local prosodic feature of the sample speech may be extracted. It should be noted that, specifically, a global prosody extraction network may be used to perform prosody extraction on the sample speech to obtain a third sample global prosody feature, and a local prosody extraction network may be used to perform prosody extraction on the sample speech to obtain a sixth sample layout prosody feature. On the basis, the sample synthesized voice can be obtained by synthesizing based on the sample text corresponding to the sample voice, the third sample global prosody feature and the sixth sample local prosody feature, and at least the network parameters of the network are adjusted and integrated based on the difference between the sample voice and the sample synthesized voice. It should be noted that the sample text, the third sample global prosodic feature, and the sixth sample local prosodic feature corresponding to the sample speech may be input to the synthesis network to obtain the sample synthesized speech. In addition, as shown in fig. 8, in the synthesis process, the third sample global prosodic feature may also be scaled by a weighting factor. Further, in the training process, the network parameters of the local prosody extraction network can be adjusted simultaneously. It should be noted that, since high-quality style data and emotion data require deduction by a professional, the data recording cost is high, and it is necessary for different speakers to generate synthesized speech with prosody changes. In the mode, the global prosodic features and the local prosodic features are decoupled from the speaker, so that the prosody of a certain speaker can be transferred to the speaker without the prosody, the speaker without emotion data can have different emotion expressions of the emotion speaker, and the requirements on high-quality emotion and style data are greatly reduced.

In a specific implementation scenario, after the training of the synthesis network is completed, the text to be synthesized, the global prosodic feature, and the local prosodic feature may be input into the synthesis network to obtain the synthesized speech. Referring to fig. 8, the global prosodic feature may be scaled by the weighting factor, and then input into the synthesis network together with the local prosodic feature and the text to be synthesized, and finally synthesized to obtain the synthesized speech, where the synthesized speech has an emotion type and a style type defined by the first speech attribute, and has a tone corresponding to the speaker identifier defined by the second speech attribute.

In an implementation scenario, different from the aforementioned method of performing speech synthesis through a text to be synthesized, when a speech to be migrated is acquired, tone migration may be performed on the speech to be migrated to obtain a synthesized speech with a tone of a target speaker, and the synthesized speech maintains speech attributes and speaking content of the original speech to be migrated, such as emotion and style. Referring to FIG. 9, FIG. 9 is a process diagram of an embodiment of emotional style migration. As shown in fig. 9, after the voice to be migrated is subjected to prosody extraction by the global prosody extraction network and the local prosody extraction network, the global prosody feature and the local prosody feature can be obtained, and at the same time, the text to be synthesized, the global prosody feature and the local prosody feature corresponding to the voice to be migrated, and the speaker identifier of the target speaker can be input into the synthesis network to obtain the synthesized voice, where the synthesized voice has the original prosody and content of the voice to be migrated, but the tone of the synthesized voice is the tone of the target speaker.

Referring to fig. 10, fig. 10 is a schematic diagram of a frame of an embodiment of the speech synthesis apparatus 100 of the present application. The speech synthesis apparatus 100 includes: the voice synthesizing method comprises an acquiring module 101, a global prosody module 102, a local prosody module 103 and a synthesizing module 104, wherein the acquiring module 101 is used for acquiring a text to be synthesized, a first voice attribute and a second voice attribute; the first voice attribute comprises at least one of emotion category and style category, and the second voice attribute comprises speaker identification; the global prosody module 102 is configured to obtain a global prosody feature having a first voice attribute, and the local prosody module 103 is configured to perform prediction based on a text to be synthesized, the first voice attribute, and a second voice attribute to obtain a local prosody feature; the global prosodic features comprise sentence-level prosodic feature information, and the local prosodic features comprise word-level prosodic feature information; the synthesis module 104 is configured to synthesize the text to be synthesized, the global prosody feature, and the local prosody feature to obtain a synthesized voice.

According to the scheme, on one hand, sentence-level prosody feature information is referred in the synthesis process, the controlled synthesized voice can be integrally in accordance with the first voice attribute, on the other hand, word-level prosody features are further referred in the synthesis process, and the local prosody fine change of the synthesized voice can be controlled. Therefore, the voice with different rhythms can be accurately and freely synthesized, and the adaptability to different scenes is improved.

In some disclosed embodiments, the synthesized speech is synthesized based on a speech synthesis model, the speech synthesis model is trained based on sample data, the sample data includes sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the global prosodic feature is obtained based on sample global prosodic features of reference sample speech, the reference sample speech is sample speech with first speech attributes.

Therefore, the synthesized voice is synthesized through the voice synthesis model, the voice synthesis model is obtained through training based on sample data, the sample data comprises sample voice marked with the sample voice attribute and sample text corresponding to the sample voice, the global prosody feature is obtained based on the sample global prosody feature of the reference sample voice, the reference sample voice is sample voice with the first voice attribute, namely the global prosody feature can be obtained from the sample data used for training the synthesis model, and therefore the convenience of extracting the global prosody feature can be improved.

In some disclosed embodiments, the global prosody module 102 includes a first obtaining sub-module, configured to fuse sample global prosody features of each reference sample speech to obtain a global prosody feature; the global prosody module 102 includes a second obtaining sub-module, configured to use a sample global prosody feature of the first target sample speech as a global prosody feature; the global prosody module 102 includes a third obtaining sub-module, configured to use a sample global prosody feature of the second target sample speech as a global prosody feature; the first target sample voice and the second target sample voice are selected from the reference sample voice, the text similarity between a sample text corresponding to the first target sample voice and a text to be synthesized meets a first condition, and the presenting intensity of the second target sample voice relative to the first voice attribute meets a second condition.

Therefore, the global prosody feature is obtained by fusing the sample global prosody features of the reference sample voices, or the sample global prosody feature of the first target sample voice is used as the global prosody feature, or the sample global prosody feature of the second target sample voice is used as the global prosody feature, the first target sample voice and the second target sample voice are both obtained by selecting from the reference sample voices, the text similarity between the sample text corresponding to the first target sample voice and the text to be synthesized meets the first condition, and the presentation strength of the second target sample voice with respect to the first voice attribute meets the second condition.

In some disclosed embodiments, the speech synthesis model includes a global prosody extraction network for extracting sample global prosody features, the sample speech attributes including at least a sample first speech attribute, the speech synthesis apparatus 100 includes a global extraction network training module including a first sample global prosody extraction sub-module for extracting first sample global prosody features of the sample speech; the global extraction network training module comprises a first sample synthesis submodule and is used for synthesizing to obtain first sample synthesized voice based on the first sample global prosody characteristic and a sample text corresponding to the sample voice; the global extraction network training module comprises a first prediction submodule and a second prediction submodule, wherein the first prediction submodule is used for predicting based on the first sample global prosody characteristic to obtain a predicted first voice attribute of the sample voice; the global extraction network training module is used for extracting global prosody characteristics of a second sample of the first sample synthesized voice; the global extraction network training module comprises a first adjusting submodule for adjusting at least network parameters of the global prosody extraction network based on a difference between the predicted first voice attribute and the sample first voice attribute and a difference between the first sample global prosody feature and the second sample global prosody feature.

Therefore, the speech synthesis model comprises a global prosody extraction network, the global prosody extraction network is used for extracting sample global prosody features, the sample speech attributes at least comprise sample first speech attributes, first sample global prosody features of sample speech are extracted based on the sample global prosody features, then first sample synthesized speech is synthesized based on the first sample global prosody features and sample texts corresponding to the sample speech, prediction is carried out based on the first sample global prosody features to obtain predicted first speech attributes of the sample speech, second sample global prosody features of the first sample synthesized speech are extracted, on the basis, at least network parameters of the global prosody extraction network are adjusted based on differences between the predicted first speech attributes and the sample first speech attributes and between the first sample global prosody features and the second sample global prosody features, therefore, the global prosody extraction network can be decoupled with information such as speakers and texts in the feature extraction process through the classification loss constraint of the voice attributes and the distribution loss constraint of the prosody features, so that the extracted global prosody features only represent feature information such as emotion and style, and the accuracy of the global prosody features is improved.

In some disclosed embodiments, the synthesized speech is synthesized based on a speech synthesis model, and the speech synthesis model includes a local prosody prediction network for directly predicting local prosody features based on the text to be synthesized, the first speech attribute, and the second speech attribute.

Therefore, the text to be synthesized, the first voice attribute and the second voice attribute are directly predicted through the local prosody prediction network to obtain the local prosody feature, and convenience in obtaining the local prosody feature is improved.

In some disclosed embodiments, the speech synthesis model further includes a local prosody extraction network, the speech synthesis model is trained based on sample data, the sample data includes sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the local prosody prediction network is trained based on the sample data after the local prosody extraction network training converges.

Therefore, the voice synthesis model further comprises a local prosody extraction network, the voice synthesis model is obtained based on sample data training, the sample data comprises sample voice marked with sample voice attributes and sample text corresponding to the sample voice, the local prosody prediction network is obtained based on the sample data training after the local prosody extraction network training is converged, the local prosody prediction network can be further trained after the local prosody extraction network training, the training of the local prosody prediction network is assisted, and the accuracy of the local prosody prediction network is improved.

In some disclosed embodiments, the sample speech attributes comprise at least a sample second speech attribute, the speech synthesis apparatus 100 comprises a local extraction network training module, the local extraction network training module comprises a first sample local prosody extraction submodule for extracting first sample local prosody features of the sample speech, the local extraction network training module comprises a first text feature extraction submodule for extracting first text features of the sample text; the local extraction network training module comprises a second prediction submodule and is used for predicting based on the first sample local prosody characteristics to obtain a predicted second voice attribute of the sample voice; the local extraction network training module comprises a second sample synthesis submodule and is used for synthesizing to obtain second sample synthesized voice based on the first sample local prosodic feature and a sample text corresponding to the sample voice; the local extraction network training module comprises a second sample local prosody extraction submodule and is used for extracting second sample local prosody characteristics of the second sample synthetic voice; the local extraction network training module includes a second adjusting sub-module for adjusting at least network parameters of the local prosody extraction network based on predicting a difference between the second speech attribute and the sample second speech attribute, a difference between the first sample local prosody feature and the first text feature, and a difference between the first sample local prosody feature and the second sample local prosody feature.

Therefore, the sample voice attributes at least comprise sample second voice attributes, and in the training process, a first sample local prosody feature of the sample voice is extracted, a first text feature of the sample text is extracted, then prediction is carried out based on the first sample local prosody feature to obtain a predicted second voice attribute of the sample voice, a second sample synthesized voice is synthesized based on the first sample local prosody feature and the sample text corresponding to the sample voice, a second sample local prosody feature of the second sample synthesized voice is extracted, on the basis, at least network parameters of the local prosody extraction network are adjusted based on the difference between the predicted second voice attribute and the sample second voice attribute, the difference between the first sample local prosody feature and the first text feature, and the difference between the first sample local prosody feature and the second sample local prosody feature, the local prosody features extracted by the local prosody extraction network can be decoupled from the speaker information and the text information, so that the local prosody features only contain characteristic information which is only related to the prosody information, such as fundamental frequency fluctuation, pronunciation duration and the like, and the accuracy of the local prosody features is favorably improved.

In some disclosed embodiments, the speech synthesis apparatus 100 includes a local prediction network training module, where the local prediction network training module includes a third sample local prosody extraction sub-module configured to extract a third sample local prosody feature of the sample speech based on a local prosody extraction network, and the local prediction network training module includes a local prosody prediction sub-module configured to predict a sample text corresponding to the sample speech and a sample speech attribute labeled by the sample speech based on a local prosody prediction network, so as to obtain a predicted local prosody feature; the local prediction network training module comprises a third adjusting submodule for adjusting network parameters of the local prosody prediction network based on the difference between the third sample local prosody characteristic and the predicted local prosody characteristic.

Therefore, the third sample local prosody feature of the sample voice is extracted based on the local prosody extraction network, the sample text corresponding to the sample voice and the sample voice attribute marked by the sample voice are predicted based on the local prosody prediction network, the predicted local prosody feature is obtained, and on the basis, the network parameters of the local prosody prediction network are adjusted based on the difference between the third sample local prosody feature and the predicted local prosody feature, so that the accuracy of the local prosody prediction network is improved.

In some disclosed embodiments, the synthesized speech is synthesized based on a speech synthesis model, and the speech synthesis model includes a prosody coding network and a text coding network, the text coding network is configured to predict a speech attribute feature based on the text to be synthesized and the first speech attribute, the speech attribute feature is independent of the speaker, and the prosody coding network is configured to predict a local prosody feature based on the speech attribute feature, the text to be synthesized, and the second speech attribute.

Therefore, the text to be synthesized and the first voice attribute are predicted by utilizing the text coding network to obtain the voice attribute characteristics, the text to be synthesized and the second voice attribute are predicted by utilizing the prosody coding network to obtain the local prosody characteristics, and the voice attribute characteristics are irrelevant to the speaker, so that the voice attributes of emotion, style, speaker and the like can be further decoupled through the prosody coding network, the control force for controlling emotion and style is favorably improved, and the distinctiveness and style are improved.

In some disclosed embodiments, the speech synthesis model further includes a local prosody extraction network, the speech synthesis model is trained based on sample data, the sample data includes sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the prosody coding network and the text coding network are trained based on the sample data after the local prosody extraction network training converges.

Therefore, after the training of the local prosody extraction network, the prosody coding network and the text coding network can be further trained, so that the training of the prosody coding network and the text coding network is assisted by training the converged local prosody extraction network, and the accuracy of the prosody coding network and the accuracy of the text coding network are improved.

In some disclosed embodiments, the sample speech attributes include a sample first speech attribute and a sample second speech attribute, the speech synthesis apparatus 100 includes an encoding network training module, the encoding network training module includes a fourth sample local prosody extraction sub-module, configured to extract a fourth sample local prosody feature of the sample speech; the coding network training module comprises a sample voice attribute extraction submodule for extracting the sample voice attribute characteristics of the local prosody characteristics of the fourth sample; the coding network training module comprises a voice attribute prediction sub-module, and is used for predicting to obtain a predicted voice attribute characteristic based on the sample text and the first voice attribute of the sample; the coding network training module comprises a fifth sample local prosody prediction sub-module, and is used for predicting to obtain a fifth sample local prosody characteristic based on the sample text, the sample second voice attribute and the predicted voice attribute characteristic; the coding network training module comprises a fourth adjusting sub-module, and the fourth adjusting sub-module is used for adjusting network parameters of a prosody coding network and a text coding network based on the difference between the sample voice attribute feature and the predicted voice attribute feature and the difference between the fourth sample local prosody feature and the fifth sample local prosody feature.

Therefore, a fourth sample local prosody feature of the sample voice is extracted, and based on the extracted sample voice attribute feature of the fourth sample local prosody feature, and based on the sample text and the sample first voice attribute, predicting to obtain a predicted voice attribute characteristic, thereby predicting to obtain a fifth sample local prosody characteristic based on the sample text, the sample second voice attribute and the predicted voice attribute characteristic, on the basis of the above, based on the difference between the sample voice attribute feature and the predicted voice attribute feature, and the difference between the local prosodic features of the fourth sample and the local prosodic features of the fifth sample, network parameters of a prosodic coding network and a text coding network are adjusted, so that the speech attributes of emotion, style, speaker and the like can be further decoupled, therefore, the control capability of controlling the emotion and the style is improved, and the distinction of the emotion and the style is favorably improved.

In some disclosed embodiments, the synthesized speech is synthesized based on a speech synthesis model, the speech synthesis model includes a global prosody extraction network, a local prosody extraction network, and a synthesis network, the speech synthesis model is trained based on sample data, the sample data includes sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the synthesis network is trained based on the sample data after the global prosody extraction network and the local prosody extraction network are respectively trained and converged.

Therefore, after the global prosody extraction network and the local prosody extraction network are trained and converged respectively, the synthetic network is trained further, and the voice synthesis effect is improved.

In some disclosed embodiments, the speech synthesis apparatus 100 includes a synthesis network training module including a sample prosody extraction sub-module for extracting a third sample global prosody feature and a sixth sample local prosody feature of the sample speech; the network training module comprises a sample synthesis submodule and is used for synthesizing a sample text, a third sample global prosody feature and a sixth sample local prosody feature which correspond to the sample voice to obtain a sample synthesized voice; the synthetic network training module comprises a fifth adjusting submodule for adjusting at least network parameters of the synthetic network based on a difference between the sample speech and the sample synthetic speech.

Therefore, the global prosodic features and the local prosodic features are decoupled from the speaker, so that the prosody of a certain speaker can be transferred to the speaker without the prosody, the speaker without emotion data can have different emotion expressions of the emotion speaker, and the requirements on high-quality emotion and style data are greatly reduced.

Referring to fig. 11, fig. 11 is a schematic diagram of a frame of an electronic device 110 according to an embodiment of the present application. The electronic device 110 comprises a memory 111 and a processor 112 coupled to each other, the memory 111 stores program instructions, and the processor 112 is configured to execute the program instructions to implement the steps in any of the above-described embodiments of the speech synthesis method. Specifically, electronic device 110 may include, but is not limited to: desktop computers, notebook computers, servers, mobile phones, tablet computers, and the like, without limitation.

In particular, the processor 112 is configured to control itself and the memory 111 to implement the steps in any of the above-described embodiments of the speech synthesis method. Processor 112 may also be referred to as a CPU (Central Processing Unit). The processor 112 may be an integrated circuit chip having signal processing capabilities. The Processor 112 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 112 may be commonly implemented by integrated circuit chips.

Referring to fig. 12, fig. 12 is a block diagram illustrating an embodiment of a computer-readable storage medium 120 according to the present application. The computer readable storage medium 120 stores program instructions 121 that can be executed by the processor, the program instructions 121 being for implementing the steps in any of the speech synthesis method embodiments described above.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method of speech synthesis, comprising:

acquiring a text to be synthesized, a first voice attribute and a second voice attribute; wherein the first voice attribute comprises at least one of an emotion category and a style category, and the second voice attribute comprises a speaker identifier;

acquiring a global prosodic feature with the first voice attribute, and predicting based on the text to be synthesized, the first voice attribute and the second voice attribute to obtain a local prosodic feature; wherein the global prosodic features comprise sentence-level prosodic feature information and the local prosodic features comprise word-level prosodic feature information;

and synthesizing based on the text to be synthesized, the global prosody feature and the local prosody feature to obtain synthesized voice.

2. The method according to claim 1, wherein the synthesized speech is synthesized based on a speech synthesis model, the speech synthesis model is trained based on sample data, the sample data includes sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the global prosodic feature is obtained based on a sample global prosodic feature of reference sample speech, the reference sample speech being sample speech with the first speech attribute.

3. The method of claim 2, wherein the obtaining the global prosodic feature having the first voice attribute comprises any one of:

fusing sample global prosodic features of the reference sample voices to obtain the global prosodic features;

taking a sample global prosody feature of a first target sample voice as the global prosody feature;

taking a sample global prosodic feature of a second target sample voice as the global prosodic feature;

the first target sample voice and the second target sample voice are selected from the reference sample voice, the text similarity between the sample text corresponding to the first target sample voice and the text to be synthesized meets a first condition, and the presentation strength of the second target sample voice relative to the first voice attribute meets a second condition.

4. The method of claim 2, wherein the speech synthesis model comprises a global prosody extraction network, and wherein the global prosody extraction network is configured to extract sample global prosody features, wherein the sample speech attributes comprise at least a sample first speech attribute, and wherein the training of the global prosody extraction network comprises:

extracting a first sample global prosodic feature of the sample speech;

synthesizing to obtain first sample synthesized voice based on the first sample global prosody feature and a sample text corresponding to the sample voice, predicting based on the first sample global prosody feature to obtain a predicted first voice attribute of the sample voice, and extracting a second sample global prosody feature of the first sample synthesized voice;

adjusting at least network parameters of the global prosody extraction network based on differences between the predicted first speech attributes and the sample first speech attributes and differences between the first sample global prosody features and the second sample global prosody features.

5. The method of claim 1, wherein the synthesized speech is synthesized based on a speech synthesis model, and wherein the speech synthesis model comprises a local prosody prediction network configured to directly predict the local prosody features based on the text to be synthesized, the first speech attributes, and the second speech attributes.

6. The method of claim 5, wherein the speech synthesis model further comprises a local prosody extraction network, wherein the speech synthesis model is trained based on sample data, wherein the sample data comprises sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and wherein the local prosody prediction network is trained based on the sample data after the local prosody extraction network training converges.

7. The method of claim 6, wherein the sample speech attributes comprise at least a sample second speech attribute, and wherein the training of the local prosody extraction network comprises:

extracting a first sample local prosody feature of the sample voice and extracting a first text feature of the sample text;

predicting based on the first sample local prosody feature to obtain a predicted second voice attribute of the sample voice, synthesizing based on the first sample local prosody feature and a sample text corresponding to the sample voice to obtain a second sample synthesized voice, and extracting a second sample local prosody feature of the second sample synthesized voice;

adjusting at least network parameters of the local prosody extraction network based on differences between the predicted second speech attributes and the sample second speech attributes, differences between the first sample local prosody feature and the first text feature, and differences between the first sample local prosody feature and the second sample local prosody feature.

8. The method of claim 6, wherein the step of training the local prosody prediction network comprises:

extracting a third sample local prosody feature of the sample voice based on the local prosody extraction network, and predicting a sample text corresponding to the sample voice and a sample voice attribute labeled by the sample voice based on the local prosody prediction network to obtain a predicted local prosody feature;

adjusting network parameters of the local prosody prediction network based on a difference between the third sample local prosody feature and the predicted local prosody feature.

9. The method of claim 1, wherein the synthesized speech is synthesized based on a speech synthesis model, and the speech synthesis model comprises a prosody coding network and a text coding network, the text coding network is configured to predict speech attribute features based on the text to be synthesized and the first speech attribute, the speech attribute features are speaker-independent, and the prosody coding network is configured to predict the local prosody features based on the speech attribute features, the text to be synthesized, and the second speech attribute.

10. The method of claim 9, wherein the speech synthesis model further comprises a local prosody extraction network, wherein the speech synthesis model is trained based on sample data, wherein the sample data comprises sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and wherein the prosody coding network and the text coding network are trained based on the sample data after the local prosody extraction network training converges.

11. The method of claim 10, wherein the sample speech attributes comprise a sample first speech attribute and a sample second speech attribute, and wherein the training of the prosodic coding network and the text coding network comprises:

extracting a fourth sample local prosodic feature of the sample voice;

extracting sample voice attribute features of the fourth sample local prosody features, and predicting to obtain predicted voice attribute features based on the sample text and the sample first voice attributes;

predicting to obtain a fifth sample local prosody feature based on the sample text, the sample second voice attribute and the predicted voice attribute feature;

adjusting network parameters of the prosodic coding network and the text coding network based on a difference between the sample speech attribute features and the predicted speech attribute features and a difference between the fourth sample local prosodic features and the fifth sample local prosodic features.

12. The method of claim 1, wherein the synthesized speech is synthesized based on a speech synthesis model, the speech synthesis model includes a global prosody extraction network, a local prosody extraction network, and a synthesis network, the speech synthesis model is trained based on sample data, the sample data includes sample speech labeled with sample speech attributes and sample text corresponding to the sample speech, and the synthesis network is trained based on the sample data after the global prosody extraction network and the local prosody extraction network are respectively trained and converged.

13. The method of claim 12, wherein the step of training the synthetic network comprises:

extracting a third sample global prosodic feature and a sixth sample local prosodic feature of the sample speech;

synthesizing based on a sample text corresponding to the sample voice, the third sample global prosody feature and a sixth sample local prosody feature to obtain a sample synthesized voice;

adjusting at least a network parameter of the synthesis network based on a difference between the sample speech and the sample synthesized speech.

14. A speech synthesis apparatus, comprising:

the acquisition module is used for acquiring a text to be synthesized, a first voice attribute and a second voice attribute; wherein the first voice attribute comprises at least one of an emotion category and a style category, and the second voice attribute comprises a speaker identifier;

the global prosody module is used for acquiring global prosody features with the first voice attributes, and the local prosody module is used for predicting based on the text to be synthesized, the first voice attributes and the second voice attributes to obtain local prosody features; wherein the global prosodic features comprise sentence-level prosodic feature information and the local prosodic features comprise word-level prosodic feature information;

and the synthesis module is used for synthesizing based on the text to be synthesized, the global prosody feature and the local prosody feature to obtain synthesized voice.

15. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions, the processor being configured to execute the program instructions to implement the speech synthesis method of any one of claims 1 to 13.

16. A computer-readable storage medium, characterized in that program instructions executable by a processor for implementing the speech synthesis method of any one of claims 1 to 13 are stored.