CN113035169B

CN113035169B - Voice synthesis method and system capable of training personalized tone library on line

Info

Publication number: CN113035169B
Application number: CN202110271444.9A
Authority: CN
Inventors: 牛歌
Original assignee: Beijing Dipai Intelligent Technology Co ltd
Current assignee: Beijing Dipai Intelligent Technology Co ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-12-07
Anticipated expiration: 2041-03-12
Also published as: CN113035169A

Abstract

The embodiment of the application provides a voice synthesis method and system capable of training a personalized tone library on line. Wherein, the method comprises the following steps: training a pre-training speech synthesis model by using at least two groups of linguistic data, wherein each group of linguistic data comprises a text and a recorded speech thereof, the recorded speech of each group of linguistic data has a tone, and the tones of the recorded speech in different groups of linguistic data are different; training a voice synthesis model based on a pre-training voice synthesis model by using the corpus of the target speaker, wherein the corpus of the target speaker comprises at least one sentence of text and recorded voice of the target speaker; the speech synthesis model is deployed in the speech synthesis system such that the speech synthesis system is used to synthesize speech of a target timbre from the input text, the target timbre being a timbre of the target speaker. The technical scheme of the application uses a small amount of linguistic data of the target speaker, a speech synthesis model is obtained through rapid and accurate training, and speech consistent with the tone color of the target speaker can be accurately synthesized through the model.

Description

Voice synthesis method and system capable of training personalized tone library on line

Technical Field

The application relates to the technical field of natural language processing, in particular to a voice synthesis method and system capable of training a personalized tone library on line.

Background

Based on the voice synthesis technology, a user (human) and a machine (such as a robot, a mobile phone, a smart sound box and the like) can realize a human-computer interaction function such as voice conversation and the like. When the user speaks into the machine, the machine replies back to the user with the default dialogs. Generally, the preset dialogs may include fixed text and variable text. The fixed text refers to a text which does not change according to any situation, and the variable text refers to a text which changes in real time according to a specific situation.

In the current human-computer interaction scene, the fixed text uses the recorded voice of the sound recorder, and the variable text uses the voice synthesis technology to synthesize the voice into the voice (hereinafter referred to as the same-tone voice) with the same tone as the recorded voice of the sound recorder, and then the recorded voice of the sound recorder is spliced with the same-tone voice to obtain the voice content replied to the user.

Although homophonic speech is mainly generated by a speech synthesis model at present, the current speech synthesis model generally has the problem of deviation because of high training cost.

Disclosure of Invention

The embodiment of the application provides a voice synthesis method and a voice synthesis system capable of training an individualized tone library on line, a small amount of linguistic data of a target speaker can be used, a voice synthesis model can be obtained through rapid and accurate training, and voices consistent with the tone color of the target speaker can be accurately synthesized through the voice synthesis model.

In a first aspect, an embodiment of the present application provides a speech synthesis method capable of training a personalized tone library on line, including: training a pre-training speech synthesis model by using at least two groups of linguistic data, wherein each group of linguistic data comprises a text and a recorded speech thereof, the recorded speech of each group of linguistic data has a tone, and the tones of the recorded speech in different groups of linguistic data are different; training a voice synthesis model based on a pre-training voice synthesis model by using a corpus of a target speaker, wherein the corpus of the target speaker comprises at least one sentence of text and voice of the at least one sentence of text recorded by the pronunciation of the target speaker; deploying the voice synthesis model in a voice synthesis system, so that the voice synthesis system is used for synthesizing voice with a target tone according to the input text, wherein the target tone is the tone of a target speaker; wherein the pre-trained speech synthesis model and the speech synthesis model have the same model structure.

In one implementation, the speech synthesis model includes, in order from input to output: the system comprises a first word embedding layer, an encoder, a repeated coding layer, a decoder and a post-processing network; the first word embedding layer, the encoder, the repeated coding layer, the decoder and the post-processing network are coupled in sequence to form a data stream; the speech synthesis model further comprises a pronunciation unit embedding layer, an output of which is coupled to the data stream.

In one implementation, the output of the pronunciation unit embedding layer is coupled into the data stream in a manner that includes any one or more of: the output of the pronunciation unit embedding layer is coupled to the input of the encoder; the output of the pronunciation unit embedding layer is coupled to the input of the decoder; the output of the sonification unit embedding layer is coupled to the input of the processing network.

In one implementation, training a pre-trained speech synthesis model using at least two sets of corpora includes: and training a pre-training speech synthesis model by taking texts in at least two groups of linguistic data as input signals of a first word embedding layer and taking frequency spectrum signals corresponding to recorded speech as supervision signals output by a post-processing network.

In one implementation, training a pre-trained speech synthesis model using at least two sets of corpora further includes: taking the tone mark corresponding to each group of linguistic data as an input signal of the pronunciation unit embedding layer, wherein the tone marks corresponding to different groups of linguistic data are different; and reserving at least one tone mark as the tone mark of the target speaker.

In one implementation, training a speech synthesis model based on a pre-trained speech synthesis model using a corpus of a target speaker includes: and training a voice synthesis model on the basis of the pre-training voice synthesis model by taking at least one sentence of text of a target speaker as an input signal of the first word embedding layer, taking a tone mark of the target speaker as an input signal of the pronunciation unit embedding layer and taking voice recorded by the target speaker as a supervision signal output by a post-processing network.

In one implementation, the method further comprises: and the iteration times when the corpus of the target speaker is used for training the voice synthesis model are less than or equal to a preset threshold value.

In one implementation, when the model loss of the speech synthesis model on the verification data of two consecutive iterations in the training is not lower than the lowest loss of the past iterations, the training of the speech synthesis model is finished; the verification data includes at least one sentence of text in the corpus of the target speaker.

In one implementation, training a speech synthesis model using a corpus of target speakers further comprises: solidifying or freezing part of parameters of the voice synthesis model, wherein the parameters are not adjusted in the training process of the voice synthesis model and are not subjected to gradient calculation, and the parameters of the pronunciation unit embedding layer are not included in the part of parameters; and/or setting the training priority of the pronunciation unit embedding layer to be higher than the priority of other parts of the speech synthesis model.

In a second aspect, an embodiment of the present application provides a speech synthesis system, including: the pre-training module is used for training a pre-training voice synthesis model by using at least two groups of linguistic data, wherein each group of linguistic data comprises a text and a recorded voice thereof, the recorded voice of each group of linguistic data has a tone, and the tones of the recorded voices in different groups of linguistic data are different; the training module is used for training the voice synthesis model based on the pre-training voice synthesis model by using the corpus of the target speaker, wherein the corpus of the target speaker comprises at least one sentence of text and voice of the at least one sentence of text recorded by the pronunciation of the target speaker; the deployment module is used for deploying the voice synthesis model in the voice synthesis system so that the voice synthesis system is used for synthesizing voice with target tone according to the input text, and the target tone is the tone of the target speaker; wherein the pre-trained speech synthesis model and the speech synthesis model have the same model structure.

According to the technical scheme of the embodiment of the application, a pre-training and training mode is adopted, in the training stage, the voice synthesis model can be trained only by a small amount of corpora of the target speaker, the efficiency is high, the speed is high, and the training of the voice synthesis model corresponding to the target speaker can be completed on line in real time or in quasi real time; in addition, in the pre-training stage, the voice synthesis model can be pre-trained by using a small amount of linguistic data of a plurality of common timbres, so that the requirement on the number of speakers is low, and the implementation is easy; in addition, because the speech synthesis model of the embodiment of the application is additionally provided with the pronunciation unit embedding layer corresponding to the training target speaker on the basis of the traditional model, the model can more accurately learn the tone characteristics of the target speaker, and therefore, the model has better effect of restoring the tone of the target speaker; in addition, because the pre-training stage and the training stage are carried out relatively independently, the embodiment of the application can freely select the corpora of different target speakers in the training stage, thereby obtaining the voice synthesis models of different target speakers and improving the tone coverage degree of the voice synthesis models.

Drawings

FIG. 1 is a logical block diagram of a speech synthesis model according to an embodiment of the present application;

FIG. 2 is a flowchart of a speech synthesis method for on-line training a personalized tone library according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a speech synthesis system provided in an embodiment of the present application;

fig. 4 is a schematic block diagram of a speech synthesis system according to an embodiment of the present application.

Detailed Description

Speech synthesis refers to a technique of artificially synthesizing human speech. In the field of computers, speech synthesis may be implemented by a speech synthesis system composed of software programs and/or hardware. A speech synthesis system generally takes text as input and outputs speech corresponding to the text. Colloquially, a speech synthesis system may be implemented to make a computer read words like a human being.

Based on the voice synthesis technology, a user (human) and a machine (such as a robot, a mobile phone, a smart sound box and the like) can realize a human-computer interaction function such as voice conversation and the like. When the user speaks into the machine, the machine replies back to the user with the default dialogs. Generally, the preset dialogs may include fixed text and variable text. The fixed text refers to a text which does not change according to any situation, and the variable text refers to a text which changes in real time according to a specific situation, for example, the variable text may be a text which changes according to personal information of a user, such as name, sex, amount of money, or the like, or a text which changes according to date, such as holiday information or the like. By way of further example, in "credit card with your end number 0081", the "credit card with your end number" belongs to the fixed text, and "0081" is the variable text related to the bank card number of the user.

In the current human-computer interaction scene, fixed texts can use voices recorded by a sound recorder, variable texts can be synthesized into voices (hereinafter referred to as same-tone voices) with the same tone as the voices recorded by the sound recorder by using a voice synthesis technology, and then the voices recorded by the sound recorder and the same-tone voices are spliced together to obtain voice contents replied to a user.

Currently, homophonic speech sounds can be synthesized, for example, by the following two schemes:

the first scheme is as follows: firstly, recording a fixed text by adopting a specific tone characteristic to obtain fixed voice; then, carrying out voice synthesis on the variable text according to the tone characteristic to obtain variable voice with the same tone characteristic as the fixed voice; finally, the fixed voice and the variable voice are spliced to obtain the voice with the tone characteristic. The timbre features used herein include one or more of fundamental frequency, speech rate, pitch, and symbol interval duration of the sound.

In the first scheme, the timbres of the fixed speech and the variable speech are simultaneously specified in a manner of setting the timbre characteristics, so that the timbre characteristics of the fixed speech and the variable speech are consistent or close to each other. However, it will be understood by those skilled in the art that: the timbre of the voice is determined by various factors, not limited to the fundamental frequency, the speech rate, the pitch, the symbol interval duration and the like, but also includes the waveform, the sound pressure, the frequency spectrum, the vibration characteristics of the vocal cord and the like, and the timbre features based on the fundamental frequency, the speech rate, the pitch and the symbol interval duration mainly represent the pronunciation style or habit of a sound recorder and are not accurate timbre, so that the scheme cannot effectively obtain the voice consistent with or close to the timbre of the fixed voice.

Scheme II: firstly, training a universal speech synthesis system model with a super-large speaker scale, wherein the system model not only accepts text as input, but also takes a small segment of speech segment of a target speaker or the frequency spectrum characteristics thereof as additional input to extract the sound characteristics of the speech segment, and then the speech synthesis of the target tone (namely the tone of the target speaker) can be completed through the system model.

The second scheme has the following defects: a voice synthesis system model with a super-large speaker scale needs to be prepared in advance, the number of speakers and the type of color coverage determine the applicability and effect of the model, if the good effect and applicability are obtained, a large number of voices of the speakers need to be prepared as corpora, and therefore the model construction cost is extremely high. In addition, in the use stage, the scheme cannot compensate when the tone color of the synthesized voice and the target tone color have obvious difference.

The embodiment of the application provides a voice synthesis method, which can use a small amount of recording of a target speaker to quickly and accurately train to obtain a voice synthesis model, and can accurately synthesize voice with the same tone as the target tone through the voice synthesis model.

The speech synthesis model provided by the embodiment of the application is realized based on a multilayer neural network. FIG. 1 is a logical block diagram of the speech synthesis model. In this case, the data flow from input to output of the speech synthesis model is entirely passed from the bottom to the top in fig. 1. Specifically, the speech synthesis model includes, in order from the input side to the output side, a first word embedding layer embedding 1, an encoder, a repetition layer repeat, a decoder, and a post-processing network Postnet, in which an output of a previous layer network is coupled to an input of a next layer network. The speech synthesis model may further comprise a pronunciation unit embedding layer embedding 2, an output of the pronunciation unit embedding layer embedding 2 being for coupling into a data stream of the speech synthesis model, for example to an input of an encoder, or to an input of a decoder, or to an input of a post-processing network Postnet. The specific coupling mode may be vector splicing, vector direct adding (when the dimensions are the same) performed on the output of the original module and the input of the pronunciation unit embedding layer embedding 2, or another coupling mode available to the neural network, which is not limited in this embodiment of the present application. It should be noted, however, that the coupled vector dimensions need to satisfy the input dimension requirements of the neural network layer to which they are to be input.

The following describes a speech synthesis method provided in the embodiment of the present application in detail with reference to the logical structure of the speech synthesis model shown in fig. 1.

Fig. 2 is a flowchart of a speech synthesis method capable of training a personalized tone library online according to an embodiment of the present application. In one embodiment, the method, as shown in FIG. 2, may include the steps of:

step S101, at least two sets of linguistic data are used for training a multi-tone pre-training voice synthesis model, wherein each set of linguistic data comprises texts and recorded voices of the texts, the recorded voices of each set of linguistic data have one tone, and the tones of the recorded voices of different sets of linguistic data are different.

According to the common scenes of the user in the conversation, when the corpora include two groups, one group of corpora can be recorded by the male speaker, namely, the recorded voice containing the male tone, and the other group of corpora can be recorded by the female speaker, namely, the recorded voice containing the female tone.

As a preferred implementation, the corpus may include four groups, which record four tones of voice corresponding to young female voice, young boy voice, mature female voice, and mature boy voice, respectively.

In some other implementations, the corpus may also include other groupings, such as further divisions of timbres: sweet female voice, high cool female voice, and cast female voice, etc., which are not limited in the embodiments of the present application.

As a preferred implementation, the amount of text in each set (each tone) of corpus may be above 1000 sentences.

Based on the corpora, the pre-trained speech synthesis model can take the text as an input signal during training and record the speech as an output supervision signal.

For example, in the context of Chinese, text may include a sequence of pronunciation units, which may be composed of a plurality of pronunciation units, each of which corresponds to the pronunciation of a Chinese character, and each of which may include pinyin text and tone symbols.

For example, wo3 is a text pronunciation unit, where wo is pinyin text and 3 is tone symbols.

For example: wo3 ai4 ni3 is a text pronunciation unit sequence, the text pronunciation unit sequence comprises 3 text pronunciation units of 'wo 3', 'ai 4' and 'ni 3', corresponding to three Chinese character pronunciations of 'I', 'love' and 'you', and therefore the recorded voice corresponding to the text pronunciation unit sequence is 'I love you'.

In specific implementation, a text may be input to the first word embedding layer embedding 1 of the pre-trained speech synthesis model, and a loss distance between a spectrum signal corresponding to a recorded speech and an output signal of the post-processing network Postnet may be calculated by a Dynamic Programming (DP) algorithm, which is used as a supervision signal for model training.

It should be particularly noted that, in the embodiment of the present application, the input signal of the pre-trained speech synthesis model may further include a tone identifier (also referred to as speaker identifier (spoke ID)) corresponding to each set of corpora, and the tone identifier may be specifically input to the pronunciation unit embedding layer embedding 2. For example, when there are four sets of corpora, corresponding to four timbres, the timbre designation may include 1, 2, 3, 4 accordingly. In addition, the embodiment of the application can reserve at least one additional tone mark for the target speaker, such as 5 and/or 6.

As a preferred implementation, the embodiment of the present application preferably reserves an additional timbre identification, and if there are multiple target speakers, the multiple target speakers may share the additional timbre identification.

It should be noted that, although the pre-trained speech synthesis model supports the tone marks 1 to 4 corresponding to the corpus and the tone mark 5 of the target speaker in the stage of pre-training the speech synthesis model, the pre-trained speech synthesis model can only predict the synthesis results corresponding to the tone marks 1 to 4, and the synthesis results corresponding to the tone marks 5 are unknown in this stage.

Step S102, a voice synthesis model is trained on the basis of the pre-training voice synthesis model by using the corpus of the target speaker, wherein the corpus of the target speaker comprises at least one sentence of text and the recorded voice corresponding to the at least one sentence of text recorded by the pronunciation of the target speaker.

In the concrete implementation, at least one sentence of text of a target speaker is used as an input signal of a first word embedding layer, a tone mark of the target speaker is used as an input signal of a pronunciation unit embedding layer, a recorded voice of the target speaker is used as a supervision signal output by a post-processing network, and at least one round of training iteration is carried out on a voice synthesis model. The training target may specifically be a voice feature of the recorded voice, such as a spectrum feature, and the like, which is not specifically limited in this embodiment of the application.

In step S102, the number of training iterations of the speech synthesis model may be limited, so as to improve the real-time performance of the speech synthesis method. Specifically, in order to achieve both real-time performance and training effect, the maximum number of training iterations of the speech synthesis model is preferably set to 10 (or other values may be used, which are not strictly limited here).

In some implementations, the embodiment of the present application may further set an early-ending strategy for training of the speech synthesis model. The strategy can be implemented by configuring a rule for early ending training, for example, when the model loss on the validation data of two consecutive iterations is not lower than the lowest loss of the past iterations.

It should be added that, when the early termination strategy is set, the total data amount of model training is not less than two sentences of texts of the target speaker and the corresponding recorded voice, wherein at least 1 sentence of text is used as verification data, and the rest of texts are used as training data. For example: when the total data volume comprises 10 sentences of texts, 9 sentences of texts can be configured as training data, and the other 1 sentence of texts can be configured as verification data; 8 sentences of texts can be configured as training data, and the other 2 sentences of texts can be configured as verification data; by analogy, the embodiments of the present application are not specifically limited herein.

In step S102, the embodiment of the present application may further limit the total data amount of the model training to improve the real-time performance of the speech synthesis method, and it is generally expected that a smaller amount of text and recorded speech of the target speaker are used to implement the fast training of the speech synthesis model. For example, the total amount of text of the target speaker may be limited to be not higher than 10 sentences, 50 sentences, 100 sentences, etc., and the embodiment of the present application is not particularly limited herein.

In step S102, in consideration of the complexity of the speech synthesis model and the limitation of the model training time by the real-time requirement, some strategies may be adopted to optimize the convergence speed of the model, where the optimization strategies that may be adopted include, but are not limited to:

1. and solidifying or freezing partial parameters of the speech synthesis model, such as the layer number of the neural network, parameter parameters, model precision and/or loss function values and the like, so that the parameters are not adjusted in the model training process and gradient calculation is not needed. In a specific implementation, when solidifying or freezing part of parameters of the speech synthesis model, at least the parameters of the pronunciation unit embedding layer embedding 2 should be ensured to be trained and adjusted, so that the speech synthesis model learns the tone characteristics of the target speaker.

2. If there are other parts that can be adjusted by training in addition to the parameters of the pronunciation unit embedding layer embedding 2, it should be ensured that the training priority of the pronunciation unit embedding layer embedding 2 is greater than the priorities of the other parts, so that the speech synthesis model preferentially learns the timbre characteristics of the target speaker. The method for ensuring that the training priority of the pronunciation unit embedding layer embedding 2 is higher than the priority of other parts includes, but is not limited to, being implemented by various methods:

a. the learning rate (learning rate) of the pronunciation unit embedding layer embedding 2 is set to be larger than that of the other parts. As a preferred implementation manner, in order to highlight the training priority of the pronunciation unit embedding layer embedding 2, the learning rate of the pronunciation unit embedding layer embedding 2 may be significantly greater than the learning rate of other parts, for example, the learning rate of the pronunciation unit embedding layer embedding 2 is set to be 100 times of the learning rate of other parts, and the embodiment of the present application does not limit this.

b. And in the previous N iterations, only the parameters of the pronunciation unit embedding layer embedding 2 are trained, and after the previous N iterations are completed, other part of parameters can be trained.

It can be understood that, since the embodiment of the present application trains the speech synthesis model using the text and the recorded speech of the target speaker in step S102, the obtained trained speech synthesis model has the capability of synthesizing the input text with the speech having the same tone as the target tone, and it is seen that the trained speech synthesis model has a corresponding relationship with the target speaker. To facilitate the description of this correspondence, an identification of the target speaker, e.g., x, can be associated with the trained speech synthesis model, and then the speech synthesis model associated with the target speaker identification x can be represented as Mx, where x is used to find the corresponding Mx model in the speech synthesis system.

Step S103, deploying the voice synthesis model in the voice synthesis system, so that the voice synthesis system is used for synthesizing the voice of the target tone according to the input text.

Fig. 3 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present application. As shown in fig. 3, at least one speech synthesis model 100 provided by the embodiment of the present application can be deployed in the speech synthesis system. When a speech synthesis model 100 is deployed in a speech synthesis system, the speech synthesis system has the capability of synthesizing speech of a target timbre for a text; when a plurality of speech synthesis models are deployed in the speech synthesis system, the plurality of speech synthesis models may be speech synthesis models obtained by training the corpora of different target speakers, so that the speech synthesis system has the capability of synthesizing speech of a plurality of target timbres for the text. Where different target speakers have different identifications, e.g., x, y, z, the corresponding speech synthesis models may be expressed as Mx, My, Mz, etc., and the timbre of the different target speakers is different.

In addition, in the speech synthesis system, each speech synthesis model can exist relatively independently, and parameters of each speech synthesis model are relatively isolated and can be called by the speech synthesis system independently.

Based on the voice synthesis system, when the voice of a target speaker needs to be synthesized, the text and the identification of the target speaker can be transmitted into the voice synthesis system; next, the speech synthesis system determines a target speech synthesis model to be used from the incoming identification, and then inputs the text and the identification of the target speaker into the target speech synthesis model; next, the target speech synthesis model executes a speech synthesis process to output speech of the target tone; finally, the voice can be played outwards through a playing device such as a loudspeaker, or the voice is spliced with other prerecorded voices of the target speaker and then played outwards through the playing device such as the loudspeaker.

It should be noted that, in one implementation, when the number of target speakers is large, which results in an excessive number of speech synthesis models in the speech synthesis system, or when the requirement on the timbre accuracy of the speech synthesis system is low or the amount of speech material of the target speakers is large, the parameters of a plurality of target speakers may be arranged in the same speech synthesis model. In the specific implementation, a plurality of identifiers of target speakers can be reserved in a speech synthesis model, each target speaker corresponds to one identifier, and the identifiers corresponding to different target speakers are different; then, training a speech synthesis model by using the linguistic data of each target speaker; thus, parameters corresponding to a plurality of target speakers can exist in the voice synthesis model, so that the voice synthesis model has the capability of synthesizing voice with a plurality of target timbres.

The voice synthesis method provided by the embodiment of the application uses at least two groups of linguistic data to pre-train a voice synthesis model, each group of linguistic data comprises a text and a recorded voice of the text, the recorded voice of each group of linguistic data has a tone, and the tones of the recorded voices of different groups of linguistic data are different; then, further training a voice synthesis model by using target speaker data, wherein the target speaker data comprises at least one text of a target speaker and a recorded voice corresponding to the text; finally, the speech synthesis model is deployed in any speech synthesis system, so that the speech synthesis system is used for synthesizing the speech of the target tone according to the text and the identification of the target speaker. According to the technical scheme of the embodiment of the application, a pre-training and training mode is adopted, in the training stage, the voice synthesis model can be trained only by a small amount of corpora of the target speaker, the efficiency is high, the speed is high, and the training of the voice synthesis model corresponding to the target speaker can be completed on line in real time or in quasi real time; in addition, in the pre-training stage, the voice synthesis model can be pre-trained by using a small amount of linguistic data of a plurality of common timbres, so that the requirement on the number of speakers is low, and the implementation is easy; in addition, because the speech synthesis model of the embodiment of the application is additionally provided with the pronunciation unit embedding layer corresponding to the training target speaker on the basis of the traditional model, the model can more accurately learn the tone characteristics of the target speaker, and therefore, the model has better effect of restoring the tone of the target speaker; in addition, because the pre-training stage and the training stage are carried out relatively independently, the embodiment of the application can freely select the corpora of different target speakers in the training stage, thereby obtaining the voice synthesis models of different target speakers and improving the tone coverage degree of the voice synthesis models.

The above embodiments describe various aspects of the speech synthesis method provided in the present application. It is to be understood that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein may be embodied in hardware, software, or a combination of hardware and software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 4 is a schematic block diagram of a speech synthesis system according to an embodiment of the present application. In one embodiment, the system implements the corresponding functions through software modules. As shown in fig. 4, the apparatus may include:

a pre-training module 201, configured to train a pre-training speech synthesis model using at least two sets of corpora, where each set of corpora includes a text and a recorded speech thereof, the recorded speech of each set of corpora has a tone, and the tones of the recorded speech in different sets of corpora are different;

a training module 202, configured to train a speech synthesis model based on a pre-trained speech synthesis model using a corpus of a target speaker, where the corpus of the target speaker includes at least one text and a speech of the at least one text recorded by a pronunciation of the target speaker;

the deployment module 203 is configured to deploy the speech synthesis model in the speech synthesis system, so that the speech synthesis system is configured to synthesize speech of a target timbre according to the input text, where the target timbre is a timbre of the target speaker.

Wherein the pre-trained speech synthesis model and the speech synthesis model have the same model structure.

Embodiments of the present application also provide a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-mentioned aspects.

Embodiments of the present application also provide a computer program product containing instructions which, when executed on a computer, cause the computer to perform the method of the above aspects.

Embodiments of the present application further provide a chip system, which includes a processor, and is configured to enable the system to implement the functions referred to in the foregoing aspects, for example, to generate or process information referred to in the foregoing methods. In one possible design, the system-on-chip further includes a memory for storing computer fingers and data necessary for a long connection system. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

The above embodiments are only intended to be specific embodiments of the present application, and are not intended to limit the scope of the embodiments of the present application, and any modifications, equivalent substitutions, improvements, and the like made on the basis of the technical solutions of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims

1. A speech synthesis method capable of training a personalized tone library on line is characterized by comprising the following steps:

training a pre-training speech synthesis model by using at least two groups of linguistic data, wherein each group of linguistic data comprises a text and a recorded speech thereof, the recorded speech of each group of linguistic data has a tone, and the tones of the recorded speech in different groups of linguistic data are different;

training a voice synthesis model based on the pre-training voice synthesis model using a corpus of a target speaker, the corpus of the target speaker including at least one sentence of text and a voice of the at least one sentence of text recorded by a pronunciation of the target speaker;

deploying the speech synthesis model in a speech synthesis system, so that the speech synthesis system is used for synthesizing speech with a target tone according to input text, wherein the target tone is the tone of a target speaker;

wherein the pre-trained speech synthesis model and the speech synthesis model have the same model structure;

the speech synthesis model comprises the following steps from input to output in sequence: the system comprises a first word embedding layer, an encoder, a repeated coding layer, a decoder and a post-processing network; the first word embedding layer, the encoder, the repetition coding layer, the decoder, and the post-processing network are coupled in sequence to form a data stream; the speech synthesis model further comprises a pronunciation unit embedding layer, an output of the pronunciation unit embedding layer being coupled into the data stream;

the training of the pre-trained speech synthesis model using at least two sets of corpora includes: using texts in the at least two groups of linguistic data as input signals of the first word embedding layer, using tone marks corresponding to each group of linguistic data as input signals of a pronunciation unit embedding layer, using tone marks corresponding to different groups of linguistic data as different tone marks, and using frequency spectrum signals corresponding to the recorded voice as supervision signals output by the post-processing network to train the pre-training voice synthesis model; at least one tone mark is reserved as the tone mark of the target speaker;

the training of the speech synthesis model based on the pre-trained speech synthesis model using the corpus of the target speaker comprises: and training the voice synthesis model on the basis of the pre-training voice synthesis model by taking at least one sentence of text of the target speaker as an input signal of the first word embedding layer, taking the tone mark of the target speaker as an input signal of the pronunciation unit embedding layer, taking the voice recorded by the target speaker as a supervision signal output by the post-processing network.

2. The method of claim 1, wherein the output of the pronunciation unit embedding layer is coupled into the data stream in a manner that includes any one or more of:

an output of the articulatory unit embedding layer is coupled to an input of the encoder;

an output of the pronunciation unit embedding layer is coupled to an input of the decoder;

an output of the pronunciation unit embedding layer is coupled to an input of the processing network.

3. The method of claim 1, further comprising: and the iteration times when the voice synthesis model is trained by using the corpus of the target speaker is less than or equal to a preset threshold value.

4. The method of claim 1, wherein training the speech synthesis model is terminated when model loss of the speech synthesis model on validation data of two consecutive iterations in training is not below a lowest loss of a past iteration; the verification data includes at least one sentence of text in the corpus of the target speaker.

5. The method according to any one of claims 1-4, wherein the training of the speech synthesis model using the corpus of the target speaker further comprises:

solidifying or freezing part of parameters of the speech synthesis model, wherein the part of parameters are not adjusted in the training process of the speech synthesis model and are not subjected to gradient calculation, and the part of parameters do not comprise the parameters of the pronunciation unit embedding layer;

and/or the like, and/or,

the training priority of the pronunciation unit embedding layer is set to be greater than the priority of other parts of the speech synthesis model.

6. A speech synthesis system, comprising:

the pre-training module is used for training a pre-training voice synthesis model by using at least two groups of linguistic data, wherein each group of linguistic data comprises a text and a recorded voice thereof, the recorded voice of each group of linguistic data has a tone, and the tones of the recorded voices in different groups of linguistic data are different;

a training module for training a speech synthesis model based on the pre-trained speech synthesis model using a corpus of a target speaker, the corpus of the target speaker including at least one text and a speech of the at least one text recorded by a pronunciation of the target speaker;

the deployment module is used for deploying the voice synthesis model in a voice synthesis system so that the voice synthesis system is used for synthesizing voice with a target tone according to input text, and the target tone is the tone of a target speaker;

the pre-training module is specifically configured to train the pre-training speech synthesis model by using texts in the at least two sets of corpora as input signals of the first word embedding layer, using spectrum signals corresponding to the recorded speech as supervision signals output by the post-processing network, using tone identifiers corresponding to each set of corpora as input signals of a pronunciation unit embedding layer, where the tone identifiers corresponding to different sets of corpora are different; at least one tone mark is reserved as the tone mark of the target speaker;

the training module is specifically configured to train the speech synthesis model on the basis of the pre-trained speech synthesis model, with at least one text of the target speaker as an input signal of the first word embedding layer, with a tone mark of the target speaker as an input signal of the pronunciation unit embedding layer, and with speech recorded by the target speaker as a supervisory signal output by the post-processing network.