CN114387946A

CN114387946A - Training method of speech synthesis model and speech synthesis method

Info

Publication number: CN114387946A
Application number: CN202011128918.6A
Authority: CN
Inventors: 卢春晖; 文学; 刘若澜; 陈萧; 楼晓雁
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2022-04-22

Abstract

The application provides a training method and a speech synthesis method of a speech synthesis model, wherein the model comprises a text coding module, a text decoding module and a first prediction coding module; the training method comprises the following steps: acquiring a training data set; the training data set comprises voice data of a pronunciation object and text data corresponding to the voice data; obtaining a phoneme coding sequence based on text data through a text coding module; obtaining a first phoneme implicit expression based on the voice data through a first predictive coding module; obtaining speech synthesis data based on the phoneme coding sequence, the first phoneme implicit expression, the pronunciation object embedding and the emotion embedding through a text decoding module; the speech synthesis model is updated based on the speech data and the speech synthesis data. The implementation of the application can effectively improve the naturalness and the emotional expressive force of the emotional voice corresponding to the pronunciation object.

Description

Training method of speech synthesis model and speech synthesis method

Technical Field

The application relates to the technical field of speech synthesis and artificial intelligence, in particular to a training method and a speech synthesis method of a speech synthesis model.

Background

Speech synthesis technology is a technology for converting input characters into speech and outputting the speech, and Emotion Speech Synthesis (ESS) is a technology for generating natural expression speech with different predefined emotion categories based on the input characters; the multi-person emotion voice synthesis can realize the voice synthesis of multiple speakers and multiple emotions only through one model on the basis of emotion voice synthesis.

In the prior art, a multi-user emotion voice synthesis technology can convert characters in one language into voice output of a specified emotion of a specified speaker. Although the prior art realizes speech synthesis for a plurality of emotions by using a multi-user emotion speech synthesis technology, the synthesized speech with emotion has low naturalness and insufficient emotion expressive power.

Disclosure of Invention

The purpose of the present application is to provide a method for training a speech synthesis model and a speech synthesis method, so as to improve the naturalness and emotional expressive power of emotional speech corresponding to a pronunciation object. The scheme provided by the embodiment of the application is as follows:

in one aspect, the present application provides a method for training a speech synthesis model, where the speech synthesis model includes a text encoding module, a text decoding module, and a first predictive encoding module; the method comprises the following steps:

acquiring a training data set; the training data set comprises voice data of a pronunciation object and text data corresponding to the voice data;

obtaining a phoneme coding sequence based on the text data through the text coding module;

obtaining, by the first predictive coding module, a first phoneme implicit representation based on the speech data;

obtaining predicted speech synthesis data based on the phoneme coding sequence, the first phoneme implicit representation, and the embedding of the pronunciation object and the embedding of the emotion through the text decoding module;

updating the speech synthesis model based on the speech data and predicted speech synthesis data.

In another aspect, the present application provides a speech synthesis method, including:

acquiring a target synthetic text, target pronunciation object embedding and target emotion embedding;

when target voice data consistent with a target synthesized text and target emotion are stored, acquiring voice synthesized data of the target pronunciation object corresponding to the target emotion through a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding, target voice data and prestored pronunciation object embedding corresponding to the target voice data; the speech synthesis model is obtained by training through the training method of the speech synthesis model provided by the application;

when target voice data consistent with the target synthesized text and the target emotion do not exist, acquiring voice synthesized data of the target pronunciation object corresponding to the target emotion through a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding and any pre-stored pronunciation object embedding; the speech synthesis model is obtained by training through the training method of the speech synthesis model provided by the application.

In another aspect, the present application provides a training apparatus for a speech synthesis model, where the speech synthesis model includes a text encoding module, a text decoding module, and a first predictive encoding module; the training apparatus includes:

the training data acquisition module is used for acquiring a training data set; the training data set comprises voice data of a pronunciation object and text data corresponding to the voice data;

the text coding module is used for obtaining a phoneme coding sequence based on the text data through the text coding module;

a phoneme coding module, configured to obtain, by the first predictive coding module, a first phoneme implicit representation based on the speech data;

the text decoding module is used for obtaining predicted speech synthesis data based on the phoneme coding sequence, the first phoneme implicit representation, the pronunciation object embedding and the emotion embedding;

an update module to update the speech synthesis model based on the speech data and predicted speech synthesis data.

In yet another aspect, the present application provides a speech synthesis apparatus, including:

the target data acquisition module is used for acquiring a target synthetic text, target pronunciation object embedding and target emotion embedding;

the first voice synthesis module is used for acquiring voice synthesis data of a target pronunciation object corresponding to a target emotion through a voice synthesis model based on target synthesized text embedding, target pronunciation object embedding, target emotion embedding, target voice data and prestored pronunciation object embedding corresponding to the target voice data when the target voice data consistent with the target synthesized text and the target emotion exist; the speech synthesis model is obtained by training through the training method of the speech synthesis model provided by the application;

the second voice synthesis module is used for obtaining voice synthesis data of the target pronunciation object corresponding to the target emotion through a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding and any pre-stored pronunciation object embedding when target voice data consistent with the target synthesized text and the target emotion does not exist; the speech synthesis model is obtained by training through the training method of the speech synthesis model provided by the application.

In another aspect, the present application provides an electronic device comprising a memory and a processor; the memory has a computer program stored therein; and the processor is used for executing the training method of the speech synthesis model provided by the application or executing the speech synthesis method provided by the embodiment of the application when the computer program is run.

In another aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program performs a training method of a speech synthesis model provided in the present application, or performs a speech synthesis method provided in an embodiment of the present application.

The advantageous effects brought by the technical solutions provided by the embodiments of the present application will be described in detail in the following description of the specific embodiments with reference to various alternative embodiments, and will not be further described herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1a is a block diagram of a prior art multi-pronouncing object speech synthesis system;

FIG. 1b is a schematic block diagram of a prior art multi-pronouncing object speech synthesis;

FIG. 2 is a schematic flow chart of a method for training a speech synthesis model according to the present application;

FIG. 3 is a block diagram of a speech synthesis model provided herein;

FIG. 4 is a block diagram of another embodiment of a speech synthesis model provided herein;

FIG. 5 is a block diagram of another embodiment of a speech synthesis model provided herein;

FIG. 6 is a schematic diagram illustrating a training method of a second predictive coding model in a speech synthesis model according to the present application;

FIG. 7 is a schematic flow chart of a speech synthesis method provided herein;

FIG. 8 is a schematic diagram illustrating a speech synthesis method according to the present application;

FIG. 9 is a block diagram of a training apparatus for a speech synthesis model according to the present application;

fig. 10 is a block diagram of a speech synthesis apparatus according to the present application;

fig. 11 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

For better understanding and description of the solutions provided by the embodiments of the present application, the related art to which the present application relates will be described first.

The speech synthesis is generally divided into a front-end module and a rear-end module, the rear-end module comprises an acoustic model and a vocoder, and the front-end module is responsible for analyzing text data and extracting relevant information required by the rear-end module. The present application is directed primarily to improvements in acoustic models in back-end processing of speech synthesis.

The emotional voice synthesis comprises emotional voice synthesis of a single pronunciation object and emotional voice synthesis of a multi-pronunciation object.

Among them, in the emotion speech synthesis of a single pronunciation object, most emotion speech synthesis systems of the prior art are built using emotion speech from a single pronunciation object, for example, emotion Statistical Parameter Speech Synthesis (SPSS) is implemented with emotion encoding, an end-to-end emotion speech synthesizer is introduced by emotion embedding (emotion embedding) through learning injection, and different emotions are represented by trained Global Style Tokens (GST). In addition to research on emotional representations, there are also some research focused on controlling the emotional intensity of synthesized speech. For example, the emotion intensity is controlled by a continuous simple scalar learned through a ranking function according to relative attributes measured on the emotion data set, and the intensity of the target emotion is gradually changed to neutral emotion based on an interpolation technique.

In the emotion speech synthesis of a multi-sound object, referring to the emotion speech synthesis system in the related art shown in fig. 1a, it is seen that a sound object (which may also be referred to as a speech object) of a synthesized speech and an emotion corresponding to the sound object are controlled by sound object embedding and emotion embedding, respectively, in addition to an encoder and a decoder. Referring to fig. 1b, which is a schematic block diagram of speech synthesis of a multi-sound object in the prior art, it can be seen that, in the prior art, a text sequence is obtained after text analysis is performed based on input text data, the text sequence is used as input data of an acoustic model, a target sound object determined in pre-stored sound objects 1-n and a target emotion determined in pre-stored emotions 1-n are input into the acoustic model, the acoustic model processes the text sequence, the target sound object and the target emotion to obtain a mel-frequency spectrum, and then a vocoder processes the mel-frequency spectrum to obtain audio data. However, when the combination of the input target utterance object and the target situation is not trained as training data, the degree of naturalness of emotional speech represented by the mel-frequency spectrum obtained by the acoustic model processing is low and the natural expression is insufficient.

In addition, the prior art does not deal with the situation of speakers having only neutral emotion voice data. Because data of neutral emotion speakers are not adopted during training, namely, the situation that neutral emotion speaker embedding and other non-neutral emotion embedding combinations do not exist, under the situation that different emotions are expressed only by adopting emotion embedding, the non-neutral emotion embedding and the speaker embedding and binding only with the neutral emotion are caused, and finally, the synthesized emotional voice corresponding to the neutral emotion speakers is low in naturalness and insufficient in emotional expressive force.

In addition, in order to realize emotion voice synthesis of a multi-sound object in the prior art, different combinations of a sound object representation and an emotion representation are studied based on a Convolutional Neural Network (CNN) by using training data having a plurality of sound objects each having a plurality of emotion types. Since there is a case where there is a part of the utterance object without corresponding emotional speech, research has also been conducted on emotional speech migration. For example, differences between neutral emotions and other emotions are studied based on an Emotion Additive Model (EAM), and migrated emotion representations are studied based on Deep Neural Network (DNN) architecture. On the basis, learning of prosodic features related to emotion by extracting implicit expression (a coarse-grained prosodic modeling) at a sentence level can be considered, but the method can learn partial prosodic features, but lacks controllability and robustness.

In order to solve at least one of the problems, the application provides a training method of a speech synthesis model and a speech synthesis method based on emotion speech synthesis and fine-grained prosody modeling of a multi-pronunciation object, wherein prosody features related to phoneme levels and emotions are learned based on the fine-grained prosody modeling, and when emotion speech synthesis is performed through a speech synthesis model obtained by training based on the training method, the naturalness and emotion expressiveness of emotion speech corresponding to a pronunciation object in the speech synthesis can be effectively improved.

In order to make the objects, technical solutions and advantages of the present application clearer, various alternative embodiments of the present application and how the technical solutions of the embodiments of the present application solve the above technical problems will be described in detail below with reference to specific embodiments and drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings. Fig. 2 shows a training method of a speech synthesis model provided in this embodiment, where the training method may be specifically executed by an electronic device provided in this embodiment, and specifically the electronic device may be a terminal or a server, the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, the server may be an independent physical server, a server cluster or a distributed system formed by multiple physical servers, or a cloud service, a cloud database, cloud computing, and a cloud function, cloud servers for cloud storage, web services, cloud communications, middleware services, domain name services, security services, Content Delivery Networks (CDNs), and basic cloud computing services such as big data and artificial intelligence platforms, but are not limited thereto. As shown in fig. 3, the speech synthesis model includes a text encoding module, a text decoding module, and a first predictive encoding module; in the application, a Text coding module and a Text decoding module are respectively formed by a Text coder and a Text decoder, wherein the structures of the Text coder and the Text decoder are consistent with FastSpeech (FastSpeech is an End-To-End Text-To-Speech Synthesis Model, E2E TTS, of an End-To-End TTS deep learning Model based on self-attention mechanism convolution); the first prediction encoding module may be a prediction Encoder Reference Encoder, and may specifically be formed by a variable Auto-Encoder (VAE), and the first prediction encoding module may adopt a structure of a 6-layer 2-dimensional convolution layer and a 1-layer Gated loop Unit (GRU), which is not limited in this application, and may also adopt other network structures. As shown in fig. 2, the training method may include the following steps S101 to S105:

step S101: a training data set is obtained.

Wherein the training data set comprises speech data of which the training data set comprises a pronunciation object, and text data corresponding to the speech data. Optionally, the voice data may correspond to different predefined emotion types, and the predefined emotion types may include four emotion types, such as angry, happy, neutral, and sad; the pronunciation object can be a human or an animal and the like, and the model can be trained by adopting the voice data and the text data of different pronunciation objects in different application scenes; in the following embodiments, description will be mainly made with respect to a case where a sound-generating object is a person. In the application, when adapting to emotional speech synthesis of a multi-pronunciation object, a training data set for training a speech synthesis model may include a plurality of speech data of a plurality of pronunciation objects and text data corresponding to the speech data; such as: may include data related to 10 pronunciation objects, 2 of which (one male, one female) have a small amount of speech data (e.g., each emotion includes 500 sentences of speech data) with predefined emotions (angry, happy, neutral, sad); of these 8 (4 men, four women) only have speech data (including 1000 sentences) of neutral feeling. In particular, the training data set includes relevant data as shown in table 1 below:

TABLE 1

Optionally, in the embodiment of the present application, a pronunciation object for voice data including all predefined emotion categories is referred to as an emotion pronunciation object; a pronunciation object for voice data with only neutral emotion is called a neutral pronunciation object; that is, the types of pronunciation objects include both emotion pronunciation objects and neutral pronunciation objects. Specifically, the content of the text data corresponding to all the voice data is not limited.

Optionally, the speech data has phoneme boundary time information, which may be manually labeled or may be obtained by forcibly aligning the text data by the recognition model.

Step S102: and obtaining a phoneme coding sequence based on the text data through a text coding module.

As shown in fig. 3, step S102 can be understood as a process of text encoding by using a text encoder, and the output result is a phoneme encoding sequence (also referred to as text encoding), and specifically, can be realized by the following steps a1-a 2:

step A1: obtaining a text sequence through front-end processing of voice synthesis based on text data; step A2: and inputting the text sequence into a text coding module to obtain a phoneme coding sequence.

Specifically, the front-end processing of speech synthesis includes a plurality of sequence labeling tasks, such as a word segmentation task, a prosodic boundary prediction task, a polyphonic disambiguation task, and the like, and a text sequence obtained after the front-end processing of speech synthesis is a phoneme sequence and can be used as input data for the rear-end processing of speech synthesis. The phoneme is the minimum voice unit divided according to the natural attributes of the voice, and is divided into two categories of vowels and consonants aiming at Chinese. The phonemes are analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme, such as where aa corresponds to one phoneme, ai corresponds to one phoneme, and d ai corresponds to two phonemes in Chinese syllables.

Optionally, in the embodiment of the present application, the text encoding module uses the text sequence y₁，…，y_UFor input data, the output data obtained is a phoneme code sequence h₁，…，h_U(ii) a Wherein, U is the number of phonemes.

For Chinese, the text sequence belongs to a sequence with tone vowels, such as h 2e 2 ch2 eng2 corresponding to synthesized binary (in front-end processing of speech synthesis, different tones can be represented by numbers 1, 2, 3 and 4, respectively; alternatively, 0 or null can also be used to represent a light sound "no tone").

Step S103: a first phoneme implicit representation is derived based on the speech data by a first predictive coding module.

In the embodiment of the application, a first prediction coding module is arranged in a speech synthesis model based on fine-grained prosody modeling; specifically, the phoneme level variation that may not be predicted from the text data input is captured by a predictive coder similar to VAE to extract the relative variation features of prosody related to emotion (such as fundamental frequency energy, etc., referred to as phoneme implicit representation in the embodiments of the present application). As shown in fig. 3, step S103 may be understood as that the first predictive coding module performs phoneme implicit coding on the first mel-spectrum graph, and the obtained output data is a first phoneme implicit representation (which may also be referred to as first phoneme implicit coding).

Step S104: and obtaining predicted speech synthesis data based on the phoneme coding sequence, the first phoneme implicit expression, the pronunciation object embedding and the emotion embedding through a text decoding module.

Specifically, after the first phoneme implicit expression is spliced in the phoneme coding sequence, a new phoneme coding sequence is obtained, and then the new phoneme coding sequence is embedded into the pronunciation object and embedded into the emotion to obtain an extended phoneme coding sequence as the input data of the text decoding module. As shown in fig. 3, after the input data is processed by the text decoder, the obtained predicted speech synthesis data is a second mel spectrum.

Step S105: the speech synthesis model is updated based on the speech data and the predicted speech synthesis data.

Specifically, a loss function may be constructed from the speech data and the predicted speech synthesis data, and then a value of the loss function may be calculated to update the network parameters of the speech synthesis model.

In the embodiment of the application, compared with the prior art, a first prediction coding module is additionally arranged on the basis of fine-grained prosody modeling, so that the naturalness and the emotion expression of emotion voice corresponding to a pronunciation object of a model applied to voice synthesis are improved when neutral pronunciation object embedding is combined for training.

Various alternative embodiments provided by the present application are described in detail below.

In an embodiment, in order to ensure that the implicit representation of the first phoneme obtained based on the first predictive coding module is independent of the pronunciation object, in the embodiment of the present application, the speech synthesis model further includes a first classification module based on the pronunciation object and a Gradient Reversal Layer (GRL) connecting the first classification module and the first predictive coding module, so as to implement unbinding of emotion and pronunciation object and improve naturalness and emotional expressiveness of the synthesized emotion speech. Specifically, the training method further includes step B1: and inputting the first phoneme implicit representation into a first classification module to obtain a first classification result.

Specifically, the first predictive coding module and the first classification module realize domain confrontation training through a gradient inversion layer. Alternatively, the first classification result may be understood as a posterior distribution of the pronunciation object.

Accordingly, step S105 updates the speech synthesis model based on the speech data and the predicted speech synthesis data, including the following steps S1051-S1054:

step S1051: a first loss function is calculated based on the speech data and the predicted speech synthesis data.

Specifically, a specific calculation process regarding the first loss function in step S1051 will be explained in the subsequent embodiments, which will not be explained here.

Step S1052: a second loss function is calculated based on the cross-entropy loss of the first classification result and the prior distribution corresponding to the first classification result.

Specifically, when only the first classification module for the pronunciation object is included in the constructed countermeasure network, the second loss function is constructed by cross entropy loss of the prior distribution of the first classification result corresponding to the first classification result.

Step S1053: and processing the second loss function through the gradient inversion layer to obtain a processed second loss function.

Specifically, the gradient inversion layer realizes back propagation in the embodiment of the application, and the corresponding neuron weight increment sign is inverted, so that the aim of countermeasure is fulfilled; i.e. the second loss function processed by the gradient inversion layer, is characterized as an inverted function.

Step S1054: and updating the speech synthesis model based on the first loss function and the processed second loss function.

Specifically, a total loss function is constructed through a first loss function and a second loss function, so that the network parameters of the speech synthesis model are updated based on the total loss function. In the total loss function, the first loss function corresponds to a positive value, and the processed second loss function corresponds to a negative value.

In one embodiment, when speech synthesis is performed for chinese, the input data of the first predictive coding module also includes tone embedding at the time of training. Accordingly, as shown in fig. 4, in order to ensure that the implicit representation of the first phoneme obtained based on the first predictive coding module is independent of the pronunciation object and the tone, in the embodiment of the present application, the speech synthesis model further includes a second classification module based on the tone in addition to the first classification module described in the above embodiment; the Gradient Reverse Layer (GRL) is also connected with the second classification module and the first prediction coding module, so that the emotion and pronunciation object, emotion and tone are unbound, and the naturalness and emotional expressive power of the synthesized emotion voice are improved.

Specifically, the training method further includes the following step B2: and the first phoneme is implicitly expressed and input into a second classification module to obtain a second classification result.

Specifically, the first predictive coding module and the second classification module realize domain confrontation training through a gradient inversion layer. Alternatively, the second classification result may be understood as a posterior distribution of the tones.

Accordingly, step S1052 calculates a second penalty function based on the cross-entropy penalty of the first classification result, including: calculating a first cross entropy loss of the prior distribution corresponding to the first classification result and the first classification result, and calculating a second cross entropy loss of the prior distribution corresponding to the second classification result and the second classification result; and obtaining a second loss function based on the sum of the first cross entropy loss and the second cross entropy loss.

Specifically, when the first classification module for the pronunciation object and the second classification module for the tone are included in the constructed confrontation network, the second loss function is constructed by the sum of the first cross-entropy loss and the second cross-entropy loss.

In the embodiment of the present application, to ensure that the first predictive coding module implicitly expresses z in coding the first phoneme_uThe relative change of prosody can be coded, the pronunciation object information and the tone information (Chinese tone is related to the absolute fundamental frequency F0) are removed, and the confrontation network is increased based on the pronunciation object and the tone respectively. OptionallyA gradient inversion layer is inserted after the input of the first and second classification modules, the gradient inversion layer applying a gradient of- λ_advAnd (the weight is inverted) scaling to realize the confrontation of the first classification module, the second classification module and the first predictive coding module.

In a possible embodiment, the first classification module and the second classification module described in the above embodiments may exist independently, and there is no dependency relationship between them. Accordingly, when there is only the first classification module or only the second classification module, the second loss function may be determined based only on the cross-entropy loss of the classification result of either classification module.

Wherein, the first predictive coding module can be composed of a variational self-coder; the countermeasure network constructed in conjunction with the first classification module and/or the second classification module may comprise a full connectivity layer and an incentive function softmax.

In an embodiment, the step S103 of obtaining, by the first predictive coding module, the first phoneme implicit representation based on the speech data includes: and inputting a first Mel spectrogram extracted based on the voice data, and a pronunciation object embedding and emotion embedding corresponding to the first Mel spectrogram into a first predictive coding module to obtain a first phoneme implicit expression.

Alternatively, the first mel spectrum extracted based on the voice data may be an output target of the voice synthesis model. The pronunciation object and emotion inputted into the first predictive coding module may correspond to the first mel-frequency spectrum.

In a possible embodiment, when the speech synthesis model is applied to a speech synthesis process of chinese, the data input to the first predictive coding module during training may further include tone embedding, that is, embedding the first mel spectrogram and the pronunciation object corresponding to the first mel spectrogram, emotion embedding, and tone embedding into the first predictive coding module to obtain the first phoneme implicit representation. With respect to tone embedding, four tones and a "no tone" light tone are used in the embodiment of the present application.

Specifically, a corresponding first Mel-map with phoneme alignment with the text sequence is used as a first predictive coding moduleThe pronunciation object, emotion and tone corresponding to the first Mel spectrogram are added as supplementary input information of the first prediction coding module, and the 3-dimensional mean value mu is predicted by adopting VAE_uAnd variance σ_uAnd resampling the 3-dimensional first phoneme implicit representation z therefrom_u(the process of resampling is indicated by the dashed line in fig. 4).

In one embodiment, the speech synthesis model shown in FIG. 4 further includes a length adjustment module; step S104, obtaining predicted speech synthesis data through a text decoding module based on the phoneme coding sequence, the first phoneme implicit expression, and the pronunciation object embedding and emotion embedding, and includes the following steps S1041-S1042:

step S1041: and splicing the first phoneme implicit expression with a phoneme coding sequence, inputting the spliced phoneme coding sequence into a length adjusting module, and obtaining the phoneme coding sequence which is expanded based on the duration of each phoneme.

In particular, the first phoneme code represents z_uWith corresponding phoneme coding sequence h_uSplicing to obtain new phoneme coding sequence

Feeding into a length adjusting module; wherein, the length adjusting module can be composed of a length adjuster.

Alternatively, the length of each phoneme aligned with the mel-spectrum may be referred to as the duration of the phoneme. The length adjusting module mainly extends the spliced phoneme coding sequence according to the duration of each phoneme. In the process of obtaining the extended phoneme coding sequence based on the duration of each phoneme, the length adjuster codes each phoneme

Repeating the operation according to the phoneme length (frame number) (if the phoneme h2 is 3 frames in length, coding the corresponding phoneme

Repeat 3 times). In the embodiment of the application, settingThe length adjustment module is intended to ensure that the phoneme coding sequence obtained by the text coding module has the same length as the sequence of the first mel spectrum corresponding to the speech data.

Step S1042: and splicing the expanded phoneme coding sequence with the pronunciation object embedding and emotion embedding corresponding to the first Mel spectrogram, and inputting the result into a text decoding module to obtain a predicted second Mel spectrogram.

Specifically, the expanded phoneme coding sequence is spliced with the pronunciation object embedding s and the emotion embedding e, the spliced data is used as the input data of the text decoding module, and the text decoding module processes the input data to obtain a predicted second Mel spectrogram

Simultaneous second Meier spectrogram

May be output as a speech synthesis model (i.e. in the embodiment of the present application, the speech synthesis data output by the model is specifically the second mel spectrum).

The text decoder can learn pronunciation object embedding and emotion embedding to predict the Mel frequency spectrum diagram through the phoneme coding sequence with adjustable length (the phoneme coding sequence after expansion). Alternatively, to represent different emotion types, the data related to emotion may be processed using a learnable emotion embedding method, or using a weighted sum method of GST.

Optionally, the length adjustment module includes a duration prediction unit, and step S1041 inputs the spliced phoneme coding sequence into the length adjustment module to obtain a phoneme coding sequence extended based on the duration of each phoneme, and further includes the following steps S10411 to S10413:

step S10411: and determining the prediction duration of each phoneme in the phoneme coding sequence by a duration prediction unit based on the pronunciation object embedding and the emotion embedding corresponding to the first Mel spectrogram.

Step S10412: a target duration for each phoneme in the phoneme coding sequence is determined based on the first mel-spectrum.

Specifically, the target duration belongs to real information, and an additional TTS model can be used to extract attention alignment information between encoders and decoders. The target duration characterizes a real duration of each phoneme in the first mel-frequency spectrum corresponding to the speech data.

Step S10413: the duration prediction unit is updated based on the prediction duration and the target duration.

Alternatively, a loss function may be constructed by predicting the duration and the target duration, for example, a mean square error may be used as the loss function, and the duration prediction unit may be updated based on the loss function.

The above steps S10411 to S10413 are used for training the duration prediction unit, and when training the speech synthesis model, the target duration is specifically adopted for training. It is understood that the duration prediction unit is trained in conjunction with the speech synthesis model.

Alternatively, the training of the duration prediction unit may be performed while training the speech synthesis model. Specifically, in the training phase, the true duration of each phoneme may be extracted from the speech data in the training data set; and in the application stage (when speech synthesis is carried out online), the prediction duration of each phoneme can be determined by combining the target pronunciation object embedding and the target emotion embedding through the duration prediction unit. That is, the predicted duration of each phoneme depends on the target pronunciation object and the target emotion.

In a possible embodiment, in the duration prediction unit, besides determining the predicted duration of each phoneme based on the target pronunciation object and the target emotion, the predicted duration of each phoneme can be adjusted according to the given information of speech rate, pitch, etc.

In one embodiment, step S105 updates the speech synthesis model based on the speech data and the predicted speech synthesis data, including the following steps S1501-S1053:

step S1051: and determining the reconstruction error of the Mel spectrogram based on the first Mel spectrogram and the second Mel spectrogram.

Specifically, the first mel spectrum may be obtained by extracting the voice data in the training data, that is, in the training stage, the first mel spectrum may be regarded as a real mel spectrum; the second mel spectrogram is output data of the text decoding module and can also be used as a predicted mel spectrogram output by the speech synthesis model; on this basis, the reconstruction error between the real mel-frequency spectrum and the predicted mel-frequency spectrum is first determined before the first loss function is calculated.

Step S1052: a relative entropy loss is determined based on the a priori distributions of the first phoneme implicit representation corresponding to the first phoneme implicit representation.

Optionally, for the first predictive coding module added in this embodiment of the present application, output data of the module is a first phoneme implicit representation, and when the speech synthesis model includes the first predictive coding model, the first loss function includes, in addition to the reconstruction error obtained in step S1051, a relative entropy loss between prior distributions corresponding to the first phoneme implicit representation. Wherein, the relative entropy can be understood as KL divergence in the embodiment of the present application.

Step S1053: the value of the first loss function is calculated based on the reconstruction error and the relative entropy loss.

Specifically, the first loss function may employ L_ELBOIs shown by L_ELBOIs a beta-VAE target under standard Gaussian priors, specifically, L_ELBOCan be expressed as shown in the following formula (1):

wherein, the first term in the formula (1) is the reconstruction error obtained in step S1051, and the second term is the relative entropy loss obtained in step S1052. Wherein x is a Mel spectrogram, y is a text sequence, z represents a first phoneme implicit representation, and z represents a second phoneme implicit representation_uCorresponding to the implicit expression of the U-th phoneme, U is the number of phonemes, s is the embedding of pronunciation objects, e is the embedding of emotions, D_KLThe characterization is calculated by using KL divergence, and N (0, I) is used for representing normal distribution. Alternatively, 0 < λ is set_KL＜1。

Optionally, when the speech synthesis model is updated based on the first loss function and the second loss function, the total loss function L may be constructed based on the first loss function and the second loss function, and may be specifically expressed as shown in the following formula (2):

the total loss function may be expressed as combining the antagonistic training objectives (i.e., Evidence lower Bound event lower Bound, ELBO and domain antagonistic training) between the first loss function and the second loss function. Wherein,

is the sum of cross-entropy losses of two competing networks, λ_advIs the loss of weight (expressed here as negative, which can be understood as the competing relationship between the first predictive coding module and the competing network).

In an embodiment, as shown in fig. 5 and 6, in consideration of the fact that under the condition of online speech synthesis, a mel spectrum corresponding to no real speech data is used as an input of the first predictive coding module, the speech synthesis model of the embodiment of the present application further includes a second predictive coding module, which is used for predicting the implicit representation of each phoneme under the target emotion through the trained second predictive coding module, so as to replace the implicit representation extracted from the real speech data.

Specifically, the training method of the speech synthesis model further includes the following steps S201 to S203:

step S201: and obtaining the mean value of the implicit representation of the second phoneme based on the speech data through a first predictive coding module in the trained speech synthesis model.

Specifically, the first prediction coding module reference encoder in the speech synthesis model trained through steps S101-S105 is adopted to extract the mean value mu of phoneme-level implicit expression of each piece of speech data in the training data₁，…，μ_UAs an output target of the second predictive coding module. The second phoneme implicit representation may also be referred to as a second phoneme implicit compilationAnd (4) code.

Step S202: and inputting a text sequence obtained by performing front-end processing on the text data through speech synthesis into a second predictive coding module to obtain a third phoneme implicit expression.

Specifically, the content can be described with reference to step a1 in the case of subjecting text data to front-end processing of speech synthesis to obtain a text sequence, and will not be described in detail here. Wherein, the third phoneme implicit expression (also called as third phoneme implicit coding) can be understood as the predicted value output by the second predictive coding module

Step S203: updating the second predictive coding module with the mean of the third phonetic implicit representation and the second phonetic implicit representation.

Optionally, different from training of the speech synthesis model, in the embodiment of the present application, independent training is performed on the second predictive coding model, and after the first predictive coding model in the trained speech synthesis model is mainly used to extract the phoneme implicit expression from the speech data, the output data is used as a target value, and then after the second predictive coding model is used to extract the phoneme implicit expression from the text data, the output data is used as a predicted value, a third loss function (root mean square error between the predicted value and the target value) is constructed based on the predicted value and the target value, and the network parameters of the second predictive coding module are updated based on the third loss function. Therefore, the input data of the first predictive coding module is mainly a Mel spectrogram corresponding to the voice data; and the input data of the second predictive coding module is mainly a text sequence (phoneme sequence) corresponding to the text data.

In the embodiment of the present application, in order to directly predict the Phoneme implicit representation in the text data to be synthesized, a second predictive coding module (Phoneme implicit coding predictor) is additionally trained for the speech synthesis model in the embodiment of the present application in consideration that there may be no real speech data corresponding to the text data to be synthesized at the online speech synthesis stage. In addition to the text data, the predictor variables (implicit representation of the third phoneme) depend on the pronunciation object and the emotion. For each piece of speech data in the training data for training the second predictive coding module, a mean value of the phoneme implicit representations in the speech data is extracted from the first predictive coding module and used as a target value for the third implicit representation. The second predictive coding module is trained with a Mean Square Error (MSE) loss.

Specifically, when training the second predictive coding module, the data input to the second predictive coding module includes: the method comprises the steps of text sequence obtained through speech synthesis front-end processing based on text data, emotion embedding and pronunciation object embedding corresponding to the speech data input into a first predictive coding module.

The present application further provides a speech synthesis method, as shown in fig. 7 and 8, which may include the following steps S301 to S303:

step S301: acquiring a target synthetic text, target pronunciation object embedding and target emotion embedding.

Optionally, in a specific application scenario, the target synthetic text may be text data input by the user or selected from pre-stored texts; the target pronunciation object may be a pronunciation object selected by the user from pre-stored pronunciation objects (for example, if the speech synthesis system includes three pronunciation objects A, B, C, the user may select any one of the three pronunciation objects as the target pronunciation object); the target emotion can be the emotion selected by the user according to the requirement of the user. In one embodiment, the target synthesized text, the target pronunciation object, and the target emotion may also be determined by the speech synthesis system according to default settings, data processing results, and the like. The embodiments of the present application do not limit this.

Step S302: when target voice data consistent with the target synthesized text and the target emotion are stored, acquiring voice synthesized data of the target emotion corresponding to the target pronunciation object by a voice synthesized model based on the target synthesized text, the target pronunciation object embedding, the target emotion embedding, the target voice data and the prestored pronunciation object embedding corresponding to the target voice data; the speech synthesis model is obtained by training the speech synthesis model corresponding to the steps S101 to S105.

Specifically, step S302 may be applied to the case of offline speech synthesis, in the scenario of offline speech synthesis, the originally downloaded speech data (generally, one pronunciation object includes multiple kinds of speech data, for example, when the speech synthesis system is configured, several speech data corresponding to several pronunciation objects may be downloaded online, and used as a sound library for performing speech synthesis when the system is offline) is pre-stored in the general speech synthesis system. When the target synthetic text y and the target emotion e are determined based on step S301_tThen, whether the text y synthesized with the target and the target emotion e exist or not can be further determined_tConsistent target speech data. If yes, performing speech synthesis processing based on the target speech data through a speech synthesis model to obtain a target pronunciation object s_tCorresponding target emotion e_tThe speech synthesis data. The speech synthesis model at least comprises a text coding module, a text decoding module and a first prediction coding module.

In the speech synthesis system, the target speech data corresponds to the text data and also has a binding relationship with the pronunciation object, that is, when step S302 of the speech synthesis method is executed, the input data of the first predictive coding module includes the target speech data, the pre-stored pronunciation object embedding corresponding to the target speech data, and the target emotion embedding.

Alternatively, the processing in step S302 may be understood as voice migration, that is, the target voice data is migrated according to the target pronunciation object and the target emotion, and the target emotion voice data corresponding to the target pronunciation object is obtained after migration (in the embodiment of the present application, the voice data is represented as a mel spectrum).

Step S303: when target voice data consistent with the target synthesized text and the target emotion do not exist, acquiring voice synthesized data of the target emotion corresponding to the target pronunciation object through a voice synthesis model based on the target synthesized text, the target pronunciation object embedding, the target emotion embedding and any pre-stored pronunciation object embedding; the speech synthesis model is obtained by training the speech synthesis model corresponding to steps S201 to S203 by the training method.

In particular, the stepsS303 may be adapted to a situation of online speech synthesis, in a scene of online speech synthesis, a user generally inputs a target synthesized text and selects a target emotion according to actual requirements, and at this time, target speech data consistent with the target synthesized text and the target emotion generally does not exist in a speech synthesis system. At this time, a speech synthesis model including the second predictive coding module is called to perform speech synthesis processing. When the target synthetic text y and the target emotion e are determined based on step S301_tThen, whether the text y synthesized with the target and the target emotion e exist or not can be further determined_tConsistent target speech data. If the target pronunciation object s does not exist, performing voice synthesis processing based on the target synthesis text through the voice synthesis model to obtain the target pronunciation object s_tCorresponding target emotion e_tThe speech synthesis data. The speech synthesis model at least comprises a text coding module, a text decoding module and a second prediction coding module.

Alternatively, the processing in step S303 may be understood as speech synthesis, that is, speech synthesis is performed based on the target synthesized text, the target pronunciation object embedding, and the target emotion embedding, so as to obtain target emotion speech data corresponding to the target pronunciation object (in the embodiment of the present application, the speech data is represented as a mel spectrum).

In one embodiment, the step S302 of obtaining the speech synthesis data of the target emotion corresponding to the target pronunciation object based on the target synthesized text, the target pronunciation object embedding, the target emotion embedding, the target speech data and the pre-stored pronunciation object embedding corresponding to the target speech data by the speech synthesis model includes the following steps S3021 to S3023:

step S3021: and inputting a first Mel spectrogram extracted based on the target voice data, and a pre-stored pronunciation object embedding and target emotion embedding corresponding to the first Mel spectrogram into a first predictive coding module to obtain a first phoneme implicit expression.

Optionally, the input data of the first predictive coding module may further comprise tone embedding.

Step S3022: and inputting a target text sequence corresponding to the target synthetic text into a text coding module to obtain a phoneme coding sequence.

Step S3023: and obtaining a second Meier spectrogram of the target pronunciation object corresponding to the target emotion through a text decoding module based on the phoneme coding sequence, the first phoneme implicit representation, the target pronunciation object embedding and the target emotion embedding.

Specifically, a Mel spectrogram x corresponding to target voice data of a pre-stored pronunciation object_tPrestoring pronunciation objects s_tTarget emotion e_tAnd the tone sequence is used as the input data of the first predictive coding module reference encoder, the first phoneme implicit expression extracted by the first predictive coding module and the phoneme coding sequence h obtained by the text coding module are used for indicating the first phoneme implicit expression_uAnd obtaining an extended phoneme coding sequence as input data of the length adjusting module, and obtaining voice synthesis data of the target emotion corresponding to the target pronunciation object by taking the extended phoneme coding sequence, the target pronunciation object and the target emotion as input data of the text decoding module. Alternatively, the speech synthesis data predicted by the speech synthesis model is a mel-frequency spectrum.

In one embodiment, in step S303, embedding the target synthesized text, the target pronunciation object, the target emotion and any pre-stored pronunciation object by a speech synthesis model to obtain speech synthesis data of the target pronunciation object corresponding to the target emotion, including the following steps S3031-S3033:

step S3031: and inputting a target text sequence corresponding to the target synthetic text, any pre-stored pronunciation object embedding and target emotion embedding into a second predictive coding module to obtain the implicit representation of the third phoneme.

Step S3032: and inputting the target text sequence into a text coding module to obtain a phoneme coding sequence.

Step S3033: the text decoding module obtains a second Meier spectrogram of the target emotion corresponding to the target pronunciation object based on the phoneme coding sequence, the third phoneme implicit representation, the target pronunciation object embedding and the target emotion embedding.

Specifically, any pre-stored pronunciation object is embedded (when the target pronunciation object is a pronunciation object only with neutral emotion voice data, the target pronunciation object is different from any pre-stored pronunciation object), the target emotion is embedded, the text sequence corresponding to the target synthesized text is used as the input data of the second predictive coding module, and the third phoneme implicit expression corresponding to each phoneme is obtained

Then use

And a phoneme coding sequence h obtained by a text coder_uAnd obtaining an extended phoneme coding sequence as input data of the length adjusting module, and obtaining speech synthesis data of the target emotion corresponding to the target pronunciation object by taking the extended phoneme coding sequence, the target pronunciation object embedding and the target emotion embedding as input data of a decoder. Alternatively, the speech synthesis data predicted by the speech synthesis model is a mel-frequency spectrum.

Optionally, in the speech synthesis method provided in this embodiment of the present application, the pre-stored pronunciation object may include several pronunciation objects in a training data set used in training a speech synthesis model, a pronunciation object including only speech data with neutral emotion may be regarded as a neutral pronunciation object, and a pronunciation object including speech data with all predefined emotion categories may be regarded as an emotion pronunciation object.

In one embodiment, the target text sequence corresponding to the target synthesized text can be obtained by performing the following steps: and performing front-end processing of voice synthesis on the basis of the target synthesized text to obtain a target text sequence.

Specifically, it can be understood that the front-end processing process before the speech synthesis is performed by using the speech synthesis model provided in the present application, that is, the front-end processing result of the speech synthesis is used as the input data of the text coding module in the speech synthesis model in the embodiment of the present application. Optionally, the target text sequence is a target phoneme sequence.

Based on the speech synthesis model of the above embodiment, a post-processing network (vocoder) may be further added in the present application to convert the mel spectrum into a linear spectrum, and the Griffin-Lim algorithm is used to construct audio data for output.

In the embodiment of the present application, a speech synthesis model of a polyphonic object is constructed based on an emotion pronunciation object and a neutral pronunciation object, and the model models prosodic changes at a phoneme level by using a fine-grained vae (reference encoder) and models global prosodic changes by using sentence level representation. Further, instead of the first predictive coding module predicting the phoneme implicit representation directly from the text sequence, a second predictive coding module (phoneme implicit predictive coder) is trained.

The voice synthesis method provided by the embodiment of the application can be applied to various fields, such as vehicle-mounted voice navigation, voice assistants of intelligent terminals, robot question answering, vision-impaired reading and the like. A feasible application example is given below for the speech synthesis method provided by the embodiment of the present application:

assuming that the user wishes to use the reading book listening app in the no-network or weak-network environment, corresponding voice data is pre-stored on a display interface of the reading book listening app for different reading contents or for different speakers (pre-stored pronunciation objects, generally including at least two, depending on the offline sound library configuration of the app). At the moment, the user can select a target speaker and a target emotion according to the self requirement, if the currently selectable speaker includes reading 1, reading 2 and reading 3, and the emotion includes neutrality, distraction, heart injury, anger and the like, if the user selects the target speaker for the content A to be read as the reading 1, and the target emotion is distraction, if target voice data corresponding to a prestored pronunciation object exists for the content A to be read and the target emotion, voice migration can be performed through the voice synthesis model provided by the application, and voice migration is performed based on the target voice data to obtain voice synthesis data of the target emotion corresponding to the target speaker, and the voice synthesis data is output. At this time, the user can hear the reading content a output by the terminal, which is read by the target speaker based on the happy emotion. Specifically, the mel spectrum corresponding to the target speech data is input data of a first predictive coding module in the speech synthesis model, and the text sequence corresponding to the content a to be read is input data of a text coding module.

Assuming that the current network state is good (online), when a user wants to obtain voice data of a target speaker (such as a girl 1) under a target emotion (happy) by aiming at an input target synthetic text 'a little rabbit, white and white, and two ears are erected', the voice synthesis can be performed through the voice synthesis model provided by the embodiment of the application, and voice synthesis is performed based on the target synthetic text to obtain voice synthesis data of the target emotion corresponding to the target speaker, and the voice synthesis data is output. At this time, the user can hear the target synthesized text output by the terminal read aloud by the girl 1 based on the happy emotion. Specifically, the text sequence corresponding to the target synthesized text is input data of a second predictive coding module in the speech synthesis model, and the text sequence corresponding to the target synthesized text is input data of a text coding module. Alternatively, the target synthesized text is input by the user himself, the target speaker may be one of all speakers currently supported by the system, and the target emotion may be one of all emotion types currently supported by the system.

Optionally, the execution device of the speech synthesis method provided in the embodiment of the present application may be an electronic device provided in the present application, and the electronic device may be a terminal or a server. When the method is executed in a terminal, the step of acquiring data in step S301 may be understood as a process of inputting data by a user, and when the voice synthesis data is output in step S302 or step S303, the terminal may play directly based on the voice synthesis data. When the method is executed in a server, the step of acquiring data in step S301 may be understood as a process in which the terminal transmits data to the server, and after outputting voice synthesis data in step S302 or step S303, the server will issue the data to the terminal, and the terminal plays the data based on the received voice synthesis data.

To better show the effect of the speech synthesis model provided in the embodiment of the present application in performing emotion speech synthesis, the following experimental data shown in table 2 are given:

TABLE 2

Of the data shown in table 1, conventional 1 represents experimental data corresponding to speech synthesis processing performed by a method (BASE-EMB) using emotion embedding as emotion representation; conventional 2 represents experimental data corresponding to speech synthesis processing based on a method (BASE-GST) using a weighted sum of trained GST as different emotion representations.

The data shown in table 1 is obtained by scoring (5 points are the highest points) the Mean Opinion Score (MOS) of the naturalness, the similarity of the utterance target, and the emotional expression of each emotional utterance target having only a neutral emotion in the speech synthesis process (including speech migration and speech synthesis). Specifically, as can be seen from table 1, (1) the naturalness and emotional expressiveness of speech are significantly superior to those of the prior art regardless of speech migration or speech synthesis; (2) when the speech synthesis model is used for speech synthesis, whether the related data of the neutral pronunciation object or the emotion pronunciation object is processed, the naturalness and the emotion expressive force of the obtained speech are superior to those of the prior art.

Corresponding to the training method provided by the application, the embodiment of the application also provides a training device of the speech synthesis model, wherein the speech synthesis model comprises a text coding module, a text decoding module and a first prediction coding module; as shown in fig. 9, the training apparatus includes: a training data acquisition module 901, a text coding module 902, a phoneme coding module 903 and a text decoding module 904.

A training data obtaining module 901, configured to obtain a training data set; wherein the training data set includes speech data of the pronunciation object and text data corresponding to the speech data.

And a text coding module 902, configured to obtain, by the text coding module, a phoneme coding sequence based on the text data.

A phoneme coding module 903, configured to obtain, by the first predictive coding module, a first phoneme implicit representation based on the speech data.

And a text decoding module 904, configured to obtain predicted speech synthesis data based on the phoneme coding sequence, the first phoneme implicit representation, and the pronunciation object embedding and emotion embedding through the text decoding module.

An update module 905 for updating the speech synthesis model based on the speech data and the predicted speech synthesis data.

Optionally, the speech synthesis model further comprises a first classification module based on the pronunciation object and a gradient inversion layer connecting the first classification module and the first predictive coding module; the training device 900 further comprises:

and the first countermeasure module is used for inputting the first phoneme implicit representation into the first classification module to obtain a first classification result.

The update module 905, includes the following elements:

a first calculation unit for calculating a first loss function based on the speech data and the predicted speech synthesis data.

And the second calculation unit is used for calculating a second loss function based on the first classification result and the cross entropy loss of the prior distribution corresponding to the first classification result.

And the gradient inversion unit is used for processing the second loss function through the gradient inversion layer to obtain the processed second loss function.

And the updating unit is used for updating the speech synthesis model based on the first loss function and the second loss function.

Optionally, the speech synthesis model further comprises a second classification module based on pitch; the gradient inversion layer is also connected with the second classification module and the first predictive coding module; the training device 900 further comprises:

and the second countermeasure module is used for implicitly representing the first phoneme and inputting the first phoneme into the second classification module to obtain a second classification result.

The second computing unit is further configured to: calculating first cross entropy loss of prior distribution corresponding to the first classification result and the first classification result, and calculating second cross entropy loss of prior distribution corresponding to the second classification result and the second classification result; and obtaining a second loss function based on the sum of the first cross entropy loss and the second cross entropy loss.

The first predictive coding module is composed of a variational self-coder.

Optionally, the text encoding module 902 is further configured to obtain a text sequence through front-end processing of speech synthesis based on the text data; and inputting the text sequence into a text coding module to obtain a phoneme coding sequence.

Optionally, the phoneme coding module 903 is further configured to input a first mel spectrogram extracted based on the speech data and a pronunciation object embedding and emotion embedding corresponding to the first mel spectrogram into the first predictive coding module, so as to obtain a first phoneme implicit representation.

Optionally, the speech synthesis model further comprises a length adjustment module; the text decoding module 904 is further configured to: splicing the first phoneme implicit representation and the phoneme coding sequence, inputting the spliced phoneme coding sequence into a length adjusting module, and obtaining a phoneme coding sequence which is expanded based on the duration of each phoneme; and splicing the expanded phoneme coding sequence with the pronunciation object embedding and emotion embedding corresponding to the first Mel spectrogram, and inputting the result into a text decoding module to obtain a predicted second Mel spectrogram.

Optionally, the length adjustment module includes a duration prediction unit, and the text decoding module 904 is further configured to: determining, by the duration prediction unit, a predicted duration of each phoneme in the phoneme coding sequence based on the pronunciation object embedding and emotion embedding corresponding to the first mel spectrogram; determining a target duration for each phoneme in the phoneme encoding sequence based on the first mel-spectrum; updating the duration prediction unit based on the prediction duration and a target duration.

Optionally, the update module 905 is further configured to: determining a reconstruction error of the mel spectrogram based on the first mel spectrogram and the second mel spectrogram; determining a relative entropy loss based on a prior distribution of the first phoneme implicit representation corresponding to the first phoneme implicit representation; the value of the first loss function is calculated based on the reconstruction error and the relative entropy loss.

Optionally, the speech synthesis model further includes a second predictive coding module, and the training apparatus further includes:

and the target output module is used for obtaining the mean value of the implicit representation of the second phoneme based on the voice data through the first predictive coding module in the trained voice synthesis model.

The phoneme implicit prediction module is used for inputting a text sequence obtained by the front-end processing of the text data through the speech synthesis into the second predictive coding module to obtain a third phoneme implicit representation;

and the phoneme coding updating module is used for updating the second prediction coding module based on the mean value of the second phoneme implicit representation and the third phoneme implicit representation.

Corresponding to the speech synthesis method provided by the present application, an embodiment of the present application further provides a speech synthesis apparatus, as shown in fig. 10, including a target data obtaining module 101, a first speech synthesis module 102, and a second speech synthesis module 103.

The target data acquisition module 101 is used for acquiring a target synthetic text, target pronunciation object embedding and target emotion embedding;

the first voice synthesis module 102 is configured to, when target voice data consistent with a target synthesized text and a target emotion exists, obtain voice synthesis data of the target emotion corresponding to a target pronunciation object by a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding, target voice data and pre-stored pronunciation object embedding corresponding to the target voice data; wherein, the speech synthesis model is obtained by training through the training method of the speech synthesis model corresponding to the steps S101-S105;

the second speech synthesis module 103 is configured to, when there is no target speech data consistent with the target synthesized text and the target emotion, obtain speech synthesis data of the target emotion corresponding to the target pronunciation object based on the target synthesized text, the target pronunciation object embedding, the target emotion embedding, and any pre-stored pronunciation object embedding through the speech synthesis model; wherein, the speech synthesis model is obtained by training through the training method of the speech synthesis model corresponding to the steps S201-203.

Optionally, when the first speech synthesis module 102 is configured to perform the step of obtaining speech synthesis data of the target emotion corresponding to the target pronunciation object based on the target synthesized text, the target pronunciation object embedding, the target emotion embedding, the target speech data and the pre-stored pronunciation object embedding corresponding to the target speech data through the speech synthesis model, the following steps are further performed: inputting a first Mel spectrogram extracted based on target voice data, and pre-stored pronunciation object embedding and target emotion embedding corresponding to the first Mel spectrogram into a first predictive coding module to obtain a first phoneme implicit expression; inputting a target text sequence corresponding to the target synthetic text into a text coding module to obtain a phoneme coding sequence; and obtaining a second Meier spectrogram of the target pronunciation object corresponding to the target emotion through a text decoding module based on the phoneme coding sequence, the first phoneme implicit representation, the target pronunciation object embedding and the target emotion embedding.

Optionally, the second speech synthesis module 103 is configured to perform the following steps when performing the step of obtaining speech synthesis data of the target utterance object corresponding to the target emotion based on the target synthesized text, the target utterance object embedding, the target emotion embedding and any pre-stored utterance object embedding by the speech synthesis model: inputting a target text sequence corresponding to the target synthetic text, any pre-stored pronunciation object embedding and target emotion embedding into a second predictive coding module to obtain a third phoneme implicit expression; inputting the target text sequence into a text coding module to obtain a phoneme coding sequence; and obtaining a second Meier spectrogram of the target pronunciation object corresponding to the target emotion through a text decoding module based on the phoneme coding sequence, the third phoneme implicit representation, the target pronunciation object embedding and the target emotion embedding.

The apparatus according to the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus according to the embodiments of the present application correspond to the steps in the method according to the embodiments of the present application, and for the detailed functional description of the modules in the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.

The present application further provides an electronic device comprising a memory and a processor; wherein the memory has stored therein a computer program; the processor is adapted to perform the method provided in any of the alternative embodiments of the present application when running the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method provided in any of the alternative embodiments of the present application.

As an alternative, fig. 11 shows a schematic structural diagram of an electronic device to which the embodiment of the present application is applicable, and as shown in fig. 11, the electronic device 1100 may include a processor 1101 and a memory 1103. The processor 1101 is coupled to the memory 1103, such as by a bus 1102. Optionally, the electronic device 1100 may also include a transceiver 1104. It should be noted that the transceiver 1104 is not limited to one in practical applications, and the structure of the electronic device 1100 is not limited to the embodiment of the present application.

The Processor 1101 may be a CPU (Central Processing Unit), a general purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 1101 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

Bus 1102 may include a path that transfers information between the above components. The bus 1102 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 1102 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.

The Memory 1103 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 1103 is used for storing application program codes for executing the present application, and the execution is controlled by the processor 1101. The processor 1101 is configured to execute application program code (computer program) stored in the memory 1103 to implement the content shown in any one of the foregoing method embodiments.

In the embodiments provided herein, the above speech synthesis method performed by the electronic device can be performed using an artificial intelligence model.

According to an embodiment of the application, the method performed in the electronic device may obtain output data identifying an image or image content features in the image by using the image data or video data as input data for an artificial intelligence model. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained with a plurality of pieces of training data by a training algorithm to obtain a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model can include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by a calculation between a calculation result of a previous layer and the plurality of weight values.

Visual understanding is a technique for recognizing and processing things like human vision, and includes, for example, object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

In embodiments provided herein, at least one of the plurality of modules may be implemented by an AI model. The functions associated with the AI may be performed by the non-volatile memory, the volatile memory, and the processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors (e.g., a Central Processing Unit (CPU), an Application Processor (AP), etc.), or pure graphics processing units (e.g., a Graphics Processing Unit (GPU), a Vision Processing Unit (VPU), and/or AI-specific processors (e.g., a Neural Processing Unit (NPU)).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operating rules or artificial intelligence models are provided through training or learning.

Here, the provision by learning means that a predefined operation rule or an AI model having a desired characteristic is obtained by applying a learning algorithm to a plurality of learning data. This learning may be performed in the device itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may be comprised of layers including multiple neural networks. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), generative confrontation networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to make, allow, or control the target device to make a determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The training method of a speech synthesis model is characterized in that the speech synthesis model comprises a text coding module, a text decoding module and a first predictive coding module; the method comprises the following steps:

2. The method of claim 1, wherein the speech synthesis model further comprises a first classification module based on pronunciation objects, and a gradient inversion layer connecting the first classification module and the first predictive coding module; the method further comprises the following steps:

inputting the first phoneme implicit representation into the first classification module to obtain a first classification result;

said updating the speech synthesis model based on the speech data and predicted speech synthesis data comprises:

calculating a first loss function based on the speech data and predicted speech synthesis data;

calculating a second loss function based on the first classification result and the cross entropy loss of the prior distribution corresponding to the first classification result;

processing the second loss function through the gradient inversion layer to obtain a processed second loss function;

updating the speech synthesis model based on the first loss function and the processed second loss function.

3. The method of claim 2, wherein the speech synthesis model further comprises a second classification module based on pitch; the gradient inversion layer is also connected with the second classification module and the first prediction coding module; the method further comprises the following steps:

the first phoneme is implicitly represented and input into the second classification module to obtain a second classification result;

said calculating a second penalty function based on cross-entropy penalties of said first classification result comprising:

calculating first cross entropy loss of prior distribution corresponding to the first classification result and the first classification result, and calculating second cross entropy loss of prior distribution corresponding to the second classification result and the second classification result;

and obtaining a second loss function based on the sum of the first cross entropy loss and the second cross entropy loss.

4. The method of claim 1, wherein the first predictive coding module is comprised of a variational self-coder.

5. The method of claim 1, wherein said deriving, by said first predictive coding module, a first phoneme implicit representation based on said speech data comprises:

and inputting a first Mel spectrogram extracted based on the voice data, and pronunciation object embedding and emotion embedding corresponding to the first Mel spectrogram into the first predictive coding module to obtain a first phoneme implicit expression.

6. The method of claim 5, wherein the speech synthesis model further comprises a length adjustment module; the obtaining, by the text decoding module, predicted speech synthesis data based on the phoneme coding sequence, the first phoneme implicit representation, and the pronunciation object embedding and emotion embedding, includes:

splicing the first phoneme implicit representation and the phoneme coding sequence, inputting the spliced phoneme coding sequence into the length adjusting module, and obtaining a phoneme coding sequence which is expanded based on the duration of each phoneme;

and splicing the expanded phoneme coding sequence with the pronunciation object embedding and emotion embedding corresponding to the first Mel spectrogram, and inputting the result into the text decoding module to obtain a predicted second Mel spectrogram.

7. The method of claim 6, wherein the length adjustment module comprises a duration prediction unit, and the step of inputting the spliced phoneme coding sequence into the length adjustment module to obtain the phoneme coding sequence expanded based on the duration of each phoneme comprises:

determining, by the duration prediction unit, a predicted duration of each phoneme in the phoneme coding sequence based on the pronunciation object embedding and emotion embedding corresponding to the first mel spectrogram;

determining a target duration for each phoneme in the phoneme encoding sequence based on the first mel-spectrum;

updating the duration prediction unit based on the prediction duration and a target duration.

8. The method of claim 6, wherein updating the speech synthesis model based on the speech data and predicted speech synthesis data comprises:

determining a reconstruction error of the mel spectrogram based on the first mel spectrogram and the second mel spectrogram;

determining a relative entropy loss based on a prior distribution of the first phoneme implicit representation corresponding to the first phoneme implicit representation;

a first loss function is calculated based on the reconstruction error and the relative entropy loss, and the speech synthesis model is updated based on the first loss function.

9. The method according to any of claims 1-8, wherein the speech synthesis model further comprises a second predictive coding module; the method further comprises the following steps:

obtaining a mean value of the implicit representation of the second phoneme based on the voice data through a first predictive coding module in the trained voice synthesis model;

inputting a text sequence obtained by performing front-end processing on the text data through speech synthesis into the second predictive coding module to obtain a third phoneme implicit expression;

updating the second predictive coding module based on the mean of the second phoneme implicit representation and the third phoneme implicit representation.

10. A method of speech synthesis, comprising:

when target voice data consistent with a target synthesized text and a target emotion are stored, acquiring voice synthesized data of the target pronunciation object corresponding to the target emotion through a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding, target voice data and prestored pronunciation object embedding corresponding to the target voice data; wherein the speech synthesis model is obtained by training through the training method of the speech synthesis model according to any one of claims 1-8;

when target voice data consistent with the target synthesized text and the target emotion do not exist, acquiring voice synthesized data of the target pronunciation object corresponding to the target emotion through a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding and any pre-stored pronunciation object embedding; wherein the speech synthesis model is trained by the training method of the speech synthesis model according to claim 9.

11. The method of claim 10,

the obtaining of the speech synthesis data of the target pronunciation object corresponding to the target emotion through the speech synthesis model based on the target synthesized text, the target pronunciation object embedding, the target emotion embedding, the target speech data and the pre-stored pronunciation object embedding corresponding to the target speech data includes:

inputting a first Mel spectrogram extracted based on the target voice data, and pre-stored pronunciation object embedding and target emotion embedding corresponding to the first Mel spectrogram into the first predictive coding module to obtain a first phoneme implicit expression;

inputting a target text sequence corresponding to the target synthetic text into the text coding module to obtain a phoneme coding sequence;

obtaining a second Meier spectrogram of the target pronunciation object corresponding to the target emotion on the basis of the phoneme coding sequence, the first phoneme implicit representation, the target pronunciation object embedding and the target emotion embedding through the text decoding module;

the obtaining of the speech synthesis data of the target pronunciation object corresponding to the target emotion through the speech synthesis model based on the target synthesized text, the target pronunciation object embedding, the target emotion embedding and any pre-stored pronunciation object embedding comprises:

inputting a target text sequence corresponding to the target synthetic text, any pre-stored pronunciation object embedding and target emotion embedding into the second predictive coding module to obtain a third phoneme implicit expression;

inputting the target text sequence into the text coding module to obtain a phoneme coding sequence;

and obtaining a second Meier spectrogram of the target pronunciation object corresponding to the target emotion on the basis of the phoneme coding sequence, the third phoneme implicit representation, the target pronunciation object embedding and the target emotion embedding through the text decoding module.

12. The training device of the speech synthesis model is characterized in that the speech synthesis model comprises a text coding module, a text decoding module and a first predictive coding module; the training apparatus includes:

13. A speech synthesis apparatus, comprising:

the first voice synthesis module is used for obtaining voice synthesis data of a target pronunciation object corresponding to a target emotion through a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding, target voice data and prestored pronunciation object embedding corresponding to the target voice data when the target voice data consistent with the target synthesized text and the target emotion exists; wherein the speech synthesis model is obtained by training through the training method of the speech synthesis model according to any one of claims 1-8;

the second voice synthesis module is used for obtaining voice synthesis data of the target pronunciation object corresponding to the target emotion through a voice synthesis model based on the target synthesized text, target pronunciation object embedding, target emotion embedding and any pre-stored pronunciation object embedding when target voice data consistent with the target synthesized text and the target emotion does not exist; wherein the speech synthesis model is trained by the training method of the speech synthesis model according to claim 9.

14. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor, when running the computer program, is configured to perform the method of any one of claims 1 to 9, or to perform the method of any one of claims 10 to 11.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 9 or carries out the method of any one of claims 10 to 11.