CN112365876A

CN112365876A - Method, device and equipment for training speech synthesis model and storage medium

Info

Publication number: CN112365876A
Application number: CN202011364603.1A
Authority: CN
Inventors: 刘龙飞; 陈昌滨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-02-12
Anticipated expiration: 2040-11-27
Also published as: CN112365876B

Abstract

The application discloses a training method, a device, equipment and a storage medium of a speech synthesis model, and relates to the technical field of deep learning and speech. The specific implementation scheme is as follows: acquiring user sample data, an initial voice synthesis model and corresponding pre-training data; dividing user sample data to obtain first sample data and second sample data; training an initial speech synthesis model by adopting first sample data and pre-training data, and acquiring a plurality of first speech synthesis models obtained by training in the training process; selecting a target speech synthesis model from a plurality of first speech synthesis models; and performing fine tuning training on the target voice synthesis model by adopting second sample data to obtain a trained voice synthesis model. Therefore, the voice synthesis model capable of outputting high-quality voice synthesis results can be trained only by user sample data with small scale, and the voice synthesis process is short in time consumption and low in cost.

Description

Method, device and equipment for training speech synthesis model and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and speech technologies, and in particular, to a method and an apparatus for training a speech synthesis model, an electronic device, and a storage medium.

Background

With the continuous development of artificial intelligence and multimedia technology, the application of the voice synthesis technology is more and more extensive, for example, the voice synthesis technology can be applied to various scenes such as map voice packet customization, star customer service, intelligent sound box broadcasting, novel reading and the like.

In the related art, the method for realizing personalized speech synthesis based on the user speech data with small data volume has the advantages that the naturalness and the fluency of the synthesized speech packet are general, and the quality of the speech synthesis result is low, while the method for realizing personalized speech synthesis based on the user speech data with large volume needs the user to record a large amount of speech data in a professional recording studio, the cost is high, and the speech synthesis process is long in time consumption. Therefore, a speech synthesis method with short time consumption, low cost and high quality of speech synthesis result is needed.

Disclosure of Invention

The disclosure provides a training method, a device, equipment and a storage medium of a speech synthesis model.

According to an aspect of the present disclosure, there is provided a method for training a speech synthesis model, including: acquiring user sample data, an initial speech synthesis model and corresponding pre-training data, wherein the user sample data comprises: a plurality of user voices and a text corresponding to each user voice; dividing the user sample data to obtain first sample data and second sample data, wherein the number of user voices in the first sample data is larger than that in the second sample data; training the initial speech synthesis model by using the first sample data and the pre-training data, and acquiring a plurality of first speech synthesis models obtained by training in the training process; selecting a target speech synthesis model from the plurality of first speech synthesis models; and performing fine tuning training on the target voice synthesis model by adopting the second sample data to obtain a trained voice synthesis model.

According to another aspect of the present disclosure, there is provided a training apparatus for a speech synthesis model, including: a first obtaining module, configured to obtain user sample data, an initial speech synthesis model, and corresponding pre-training data, where the user sample data includes: a plurality of user voices and a text corresponding to each user voice; the dividing module is used for dividing the user sample data to obtain first sample data and second sample data, wherein the number of user voices in the first sample data is larger than the number of user voices in the second sample data; the first training module is used for training the initial speech synthesis model by adopting the first sample data and the pre-training data and acquiring a plurality of first speech synthesis models obtained by training in the training process; a selection module for selecting a target speech synthesis model from the plurality of first speech synthesis models; and the second training module is used for carrying out fine tuning training on the target voice synthesis model by adopting the second sample data to obtain a trained voice synthesis model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training a speech synthesis model as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the training method of a speech synthesis model as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the method of training a speech synthesis model as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow chart of a method of training a speech synthesis model according to a first embodiment of the present application;

FIG. 2 is a schematic flow chart of a method of training a speech synthesis model according to a second embodiment of the present application;

FIG. 3 is a flow chart illustrating a method for training a speech synthesis model according to a third embodiment of the present application;

FIG. 4 is a schematic structural diagram of a training apparatus for a speech synthesis model according to a fourth embodiment of the present application;

FIG. 5 is a schematic structural diagram of a training apparatus for a speech synthesis model according to a fifth embodiment of the present application;

FIG. 6 is a block diagram of an electronic device for implementing a method for training a speech synthesis model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment of the application aims at the problems and provides a training method of a voice synthesis model, which comprises the steps of firstly obtaining user sample data, an initial voice synthesis model and corresponding pre-training data, wherein the user sample data comprises a plurality of user voices and texts corresponding to the user voices, then dividing the user sample data to obtain first sample data and second sample data, wherein the number of the user voices in the first sample data is larger than that of the user voices in the second sample data, then training the initial voice synthesis model by adopting the first sample data and the pre-training data, obtaining a plurality of first voice synthesis models obtained by training in the training process, then selecting a target voice synthesis model from the plurality of first voice synthesis models, and then carrying out fine-tuning training on the target voice synthesis model by adopting the second sample data, and obtaining the trained speech synthesis model. Therefore, the voice synthesis model capable of outputting high-quality voice synthesis results can be trained only by user sample data with small scale, and the voice synthesis process is short in time consumption and low in cost.

A method, an apparatus, an electronic device, and a non-transitory computer-readable storage medium for training a speech synthesis model according to embodiments of the present application are described below with reference to the drawings.

First, referring to fig. 1, a method for training a speech synthesis model provided in the present application is described in detail.

Fig. 1 is a flowchart illustrating a method for training a speech synthesis model according to a first embodiment of the present application. It should be noted that, in the training method of a speech synthesis model provided in this embodiment, a training device whose main body is a speech synthesis model, hereinafter referred to as a training device for short, is executed, and the training device may be an electronic device or may be configured in an electronic device, so as to achieve training of obtaining a speech synthesis model capable of outputting a high-quality speech synthesis result by using user sample data with a small scale.

The electronic device may be any stationary or mobile computing device capable of performing data processing, such as a mobile computing device like a laptop, a smartphone, and a wearable device, or a stationary computing device like a desktop computer, or a server, or other types of computing devices. The training device for the speech synthesis model may be an electronic device, an application installed in the electronic device for training the obtained speech synthesis model, or a web page, an application, or the like used by a manager or a developer of the application for training the obtained speech synthesis model for managing and maintaining the application, and the present application is not limited thereto.

As shown in fig. 1, the method for training a speech synthesis model may include the following steps:

step 101, obtaining user sample data, an initial speech synthesis model and corresponding pre-training data, wherein the user sample data comprises: a plurality of user voices, and text corresponding to each user voice.

The user sample data is sample data of a specific target user needing to synthesize voice according to the text to be synthesized. For example, when it is necessary to synthesize a synthesized speech similar to the timbre, speaking style, and the like of zhangsan from a text to be synthesized, user sample data is a plurality of speeches of zhangsan and a text corresponding to each speech.

In the embodiment of the present application, the user sample data may be sample data of a smaller scale, for example, the sample data may only include 300 voices of a specific target user and text corresponding to each voice.

In an exemplary embodiment, the user sample data may be obtained in various ways, and may be selected as needed in practical applications. For example, the user sample data may be obtained by using a live recording mode, or a plurality of existing voices of the user and a text corresponding to each voice may be directly used as the user sample data.

The pre-training data includes a large number of voices and texts corresponding to each voice, and the pre-training data and the user sample data may be from different users.

The initial speech synthesis model may be any model that can be used for speech synthesis, such as a neural network model, which is not limited in this application. The speech synthesis model can be a combination of an acoustic feature model and an audio decoder, the input of the initial speech synthesis model is a text to be synthesized, and the output is synthesized speech.

102, dividing user sample data to obtain first sample data and second sample data, wherein the number of user voices in the first sample data is larger than that in the second sample data.

The number of the user voices in the first sample data and the number of the user voices in the second sample data may be set according to needs, which is not limited in this application.

In an exemplary embodiment, when the user sample data is divided, various manners may be adopted, and in practical application, the user sample data may be selected according to needs.

For example, most of the user voices included in the user sample data and the text corresponding to each of the user voices in the part of the user voices may be randomly selected as the first sample data, and the other user voices in the user sample data and the text corresponding to each of the user voices in the part of the user voices may be used as the second sample data. For example, assuming that the user sample data includes 300 voices and text corresponding to each voice, 280 voices and text corresponding to each voice in the 280 voices may be randomly used as the first sample data, and the other 20 voices and text corresponding to each voice in the 20 voices may be used as the second sample data.

Or, the multiple user voices in the user sample data may be sorted according to the acquisition time, such as the user recording time, most of the user voices sorted before and the text corresponding to each user voice in the part of the user voices are used as first sample data, and the other user voices in the user sample data and the text corresponding to each user voice in the part of the user voices are used as second sample data. For example, assuming that the user sample data includes 300 voices including zhang san and text corresponding to each voice, the 300 voices may be sequenced according to the recording time, the 280 voices which are sequenced before or recorded first and the text corresponding to each voice in the 280 voices are used as first sample data, and the other 20 voices which are sequenced after or recorded later and the text corresponding to each voice in the 20 voices are used as second sample data.

It should be noted that the method for dividing the user sample data to obtain the first sample data and the second sample data is only a schematic illustration, and a person skilled in the art may divide the user sample data by any method as needed, and only the number of the user voices in the divided first sample data is greater than the number of the user voices in the second sample data.

And 103, training the initial voice synthesis model by adopting the first sample data and the pre-training data, and acquiring a plurality of first voice synthesis models obtained by training in the training process.

In an exemplary embodiment, when an initial speech synthesis model is trained, it may be trained, for example, by deep learning, which performs better on large data sets than other machine learning methods. When the initial speech synthesis model is trained in a deep learning manner, the first sample data can be updated into the pre-training data, the text in the updated pre-training data is used as input, the speech corresponding to the text in the updated pre-training data is used as an output result, iterative training is carried out on the initial speech synthesis model by continuously adjusting the model parameters of the initial speech synthesis model until the accuracy of the output result of the initial speech synthesis model meets a preset threshold value, and the training is finished.

In the process of training the initial speech synthesis model, a speech synthesis model with adjusted model parameters can be obtained after model parameters are adjusted every time, so that a plurality of trained first speech synthesis models can be obtained from the speech synthesis models with different model parameters.

Because the plurality of first speech synthesis models are obtained by training the first sample data and the pre-training data, various parameters of the plurality of first speech synthesis models are adapted to the own characteristic features of the specific target user, and the output of the plurality of first speech synthesis models is close to the real speech of the specific target user.

Step 104, selecting a target speech synthesis model from the plurality of first speech synthesis models.

The target speech synthesis model may be a model that outputs a real speech closest to a specific target user among the plurality of first speech synthesis models.

In one possible implementation manner, the voice of a specific target user and the corresponding text may be obtained as test data, the text is input into a plurality of first voice synthesis models, an output result of each first voice synthesis model is obtained, and for each first voice synthesis model, the output result is compared with the voice corresponding to the text of the input model to determine the similarity between the output result and the voice corresponding to the input text, so that the first voice synthesis model with the highest corresponding similarity is determined as the target voice synthesis model.

And 105, performing fine tuning training on the target voice synthesis model by adopting second sample data to obtain a trained voice synthesis model.

In an exemplary embodiment, the model parameters of the target speech synthesis model may be finely adjusted by means of small sample learning according to the second sample data, so as to obtain a trained speech synthesis model whose corresponding output is closer to the real speech of the specific target user.

In an exemplary embodiment, before performing model training, for example, before performing fine tuning training on the initial speech synthesis model and the target speech synthesis model, preprocessing such as data denoising, data detection, data screening, segmentation and the like may be performed on training data such as pre-training data, first sample data, second sample data and the like, for example, filtering blank sections in user speech and the like, so as to improve the accuracy of the training data.

It should be noted that, for the training process of the initial speech synthesis model and the training process of the target speech synthesis model, reference may be made to a model training method in the related art, which is not described in detail in this embodiment of the present application.

It can be understood that, when training the initial speech synthesis model, only a small amount of first sample data is needed to train the initial speech synthesis model except pre-training data, so as to obtain a plurality of first speech synthesis models, so that the plurality of first speech synthesis models are adapted to the own characteristic features of the specific target user, and output the real speech close to the specific target user, and then the target speech synthesis model is selected from the plurality of first speech synthesis models, and the target speech synthesis model is subjected to fine tuning training by continuously using a small amount of second sample data, so that the model parameters of the target speech synthesis model are further subjected to fine tuning, so that the trained speech synthesis model can be further adapted to the own characteristic features of the specific target user, and thereby a high-quality speech synthesis result close to the real speech of the specific target user can be output. And because only user sample data with small scale is needed for training, the user does not need to record a large amount of voice data in a professional recording studio, thereby saving the cost and shortening the time for obtaining a voice synthesis result.

The training method of the speech synthesis model provided by the embodiment of the application firstly obtains user sample data, an initial speech synthesis model and corresponding pre-training data, wherein, the user sample data comprises a plurality of user voices and texts corresponding to each user voice, then the user sample data is divided to obtain first sample data and second sample data, wherein the number of the user voices in the first sample data is larger than the number of the user voices in the second sample data, and then the first sample data and the pre-training data are adopted, training the initial voice synthesis model, acquiring a plurality of first voice synthesis models obtained through training in the training process, selecting a target voice synthesis model from the first voice synthesis models, and performing fine tuning training on the target voice synthesis model by adopting second sample data to obtain the trained voice synthesis model. Therefore, the voice synthesis model capable of outputting high-quality voice synthesis results can be trained only by user sample data with small scale, and the voice synthesis process is short in time consumption and low in cost.

As can be seen from the above analysis, in the embodiment of the present application, the initial speech synthesis model may be trained by using the first sample data and the pre-training data, and a plurality of first speech synthesis models obtained by training are obtained in the training process, which is further described below with reference to fig. 2.

Fig. 2 is a flowchart illustrating a method for training a speech synthesis model according to a second embodiment of the present application. As shown in fig. 2, the method for training a speech synthesis model may include the following steps:

step 201, obtaining user sample data, an initial speech synthesis model and corresponding pre-training data, wherein the user sample data includes: a plurality of user voices, and text corresponding to each user voice.

Step 202, dividing the user sample data to obtain first sample data and second sample data, wherein the number of the user voices in the first sample data is larger than that in the second sample data.

The specific implementation process and principle of the steps 201-202 may refer to the description of the above embodiments, and are not described herein again.

And step 203, updating the first sample data into pre-training data to obtain updated pre-training data.

And step 204, training the initial speech synthesis model by using the updated pre-training data.

In an exemplary embodiment, the first sample data may be updated to the pre-training data to obtain updated pre-training data, and then the text in the updated pre-training data is used as input, the speech corresponding to the text in the updated pre-training data is used as an output result, the initial speech synthesis model is iteratively trained by continuously adjusting model parameters of the initial speech synthesis model until the accuracy of the output result of the initial speech synthesis model meets a preset threshold, and the training is finished.

Step 205, in the training process of the initial speech synthesis model, extracting the speech synthesis model obtained by training every preset number of steps as a first speech synthesis model.

Step 206, a target speech synthesis model is selected from the plurality of first speech synthesis models.

And step 207, performing fine tuning training on the target speech synthesis model by using the second sample data to obtain a trained speech synthesis model.

It can be understood that, in the process of training the initial speech synthesis model by using the updated pre-training data, parameters of the initial speech synthesis model, such as connection weights of neurons in a neural network structure, may be updated many times to obtain a plurality of trained speech synthesis models.

For example, assume that the updated pre-training data includes speech a and corresponding text a, speech B and corresponding text B, speech C and corresponding text C, and so on. When the initial speech synthesis model is trained by using the updated pre-training data, firstly, the text a in the updated pre-training data is input into the initial speech synthesis model to obtain an output result a ', and then parameters of the initial speech synthesis model can be updated according to the difference between the output result a' and the speech a corresponding to the text a to obtain the trained speech synthesis model 1 with the updated parameters. Further, the text B in the updated pre-training data may be input into the speech synthesis model 1 to obtain an output result B ', and the parameters of the speech synthesis model 1 are updated according to the difference between the output result B' and the speech B corresponding to the text B to obtain the trained speech synthesis model 2 with the updated parameters. Further, the text C in the updated pre-training data may be input into the speech synthesis model 2 to obtain an output result C ', and parameters of the speech synthesis model 2 are updated according to a difference between the output result C' and the speech C corresponding to the text C to obtain the trained speech synthesis model 3 with the updated parameters. By analogy, a plurality of trained speech synthesis models can be obtained through a plurality of parameter updates.

In the embodiment of the present application, a trained speech synthesis model obtained after each parameter update may be used as the first speech synthesis model. Alternatively, the trained speech synthesis models may be extracted every preset number of steps as the first speech synthesis model, for example, the preset number of steps is 2, so that the trained speech synthesis models of the first time, the third time, the fifth time, and the like are extracted as the first speech synthesis model.

The preset number of steps can be set arbitrarily according to needs, and the embodiment of the application does not limit the preset number of steps.

Further, after obtaining the plurality of first speech synthesis models, a target speech synthesis model can be selected from the plurality of first speech synthesis models, and then fine tuning training is performed on the target speech synthesis model by using second sample data, so that a trained speech synthesis model is obtained.

The specific implementation process and principle of the step 206-207 can refer to the description of the above embodiments, and are not described herein again.

In the training process of the initial speech synthesis model, the speech synthesis models obtained by training are extracted every preset step number instead of taking the speech synthesis models obtained after model parameters are updated each time as the first speech synthesis models, so that the number of the first speech synthesis models can be reduced, the data amount required to be processed when the target speech synthesis model is selected from the plurality of first speech synthesis models is reduced, the time for selecting the target speech synthesis model from the plurality of first speech synthesis models is shortened, and the time for finally obtaining the trained speech synthesis model is shortened.

Step 208, the text to be synthesized is obtained.

Step 209, the text to be synthesized is input into the trained speech synthesis model, and the synthesized speech corresponding to the text to be synthesized is obtained.

It can be understood that the trained speech synthesis model adapts to the own characteristic features of the specific target user, so that the text to be synthesized is input into the trained speech synthesis model, and the synthesized speech corresponding to the text to be synthesized and close to the real speech of the specific target user can be obtained.

The training method of the speech synthesis model provided by the embodiment of the application firstly obtains user sample data, an initial speech synthesis model and corresponding pre-training data, wherein the user sample data comprises: dividing user sample data to obtain first sample data and second sample data, wherein the number of user voices in the first sample data is larger than that of the user voices in the second sample data, updating the first sample data into pre-training data to obtain updated pre-training data, training an initial voice synthesis model by using the updated pre-training data, extracting the trained voice synthesis model at intervals of a preset number in the training process of the initial voice synthesis model to be used as a first voice synthesis model, selecting a target voice synthesis model from the multiple first voice synthesis models, performing fine tuning training on the target voice synthesis model by using the second sample data to obtain a trained voice synthesis model, and inputting the obtained text to be synthesized into the trained voice synthesis model, and acquiring synthesized voice corresponding to the text to be synthesized. Therefore, the voice synthesis model can be trained only by user sample data with small scale, synthesized voice corresponding to the text to be synthesized with high quality can be obtained through the voice synthesis model, time consumption in the voice synthesis process is short, cost is low, the requirement of generating personalized voice by a user is met, and user experience is improved.

Through the analysis, in the embodiment of the application, the plurality of first speech synthesis models can be obtained first, then the target speech synthesis model is selected from the plurality of first speech synthesis models, and then the target speech synthesis model is subjected to fine tuning training to obtain the trained speech synthesis model. The process of selecting a target speech synthesis model from a plurality of first speech synthesis models in the training method of speech synthesis models provided by the present application is further described below with reference to fig. 3.

Fig. 3 is a flowchart illustrating a method for training a speech synthesis model according to a third embodiment of the present application. As shown in fig. 3, the method for training a speech synthesis model may include the following steps:

step 301, obtaining user sample data, an initial speech synthesis model and corresponding pre-training data, where the user sample data includes: a plurality of user voices, and text corresponding to each user voice.

Step 302, dividing the user sample data to obtain first sample data and second sample data, wherein the number of the user voices in the first sample data is greater than the number of the user voices in the second sample data.

Step 303, training the initial speech synthesis model by using the first sample data and the pre-training data, and acquiring a plurality of first speech synthesis models obtained by training in the training process.

The specific implementation process and principle of the steps 301-303 can refer to the description of the above embodiments, and are not described herein again.

Step 304, obtaining loss function values of a plurality of first speech synthesis models.

Step 305, selecting a second speech synthesis model with a corresponding loss function value within a preset numerical range from the plurality of first speech synthesis models.

And the loss function value of the first voice synthesis model represents the difference degree between the voice synthesis result output by the first voice synthesis model and the real voice corresponding to the input text after the text is input into the first voice synthesis model. The larger the loss function value is, the larger the difference degree between the speech synthesis result output by the first speech synthesis model and the real speech corresponding to the input text is; the smaller the loss function value is, the smaller the degree of difference between the speech synthesis result output by the first speech synthesis model and the real speech corresponding to the input text is.

The preset value range can be set according to needs, for example, the loss function value of the speech synthesis model which is finally needed can be set.

For example, assuming that the finally required loss function value of the speech synthesis model is L, the minimum value of the preset value range can be set to L-1, and the maximum value thereof can be set to L + 1. Thus, after obtaining the loss function values of the first speech synthesis models, the first speech synthesis model with the corresponding loss function value between L-1 and L +1 can be selected as the second speech synthesis model.

Step 306, obtaining the accuracy of the second speech synthesis model for estimating the user data to be estimated.

And 307, determining the second speech synthesis model corresponding to the maximum inference accuracy in the inference accuracies as the target speech synthesis model.

Specifically, after selecting a second speech synthesis model whose corresponding loss function value is within a preset numerical range from the plurality of first speech synthesis models, if the number of the selected second speech synthesis models is 1, the 1 second speech synthesis model may be determined as the target speech synthesis model. If the number of the selected second speech synthesis models is 0, the preset numerical range may be adjusted until a second speech synthesis model whose corresponding loss function value is within the adjusted preset numerical range is selected from the plurality of first speech synthesis models. If the number of the selected second speech synthesis models is multiple, the accuracy of the second speech synthesis models for estimating the user data to be estimated can be obtained, and the second speech synthesis model corresponding to the maximum estimation accuracy in the estimation accuracy is determined as the target speech synthesis model.

The second voice synthesis model is selected according to the loss function values of the plurality of first voice synthesis models, and the target voice synthesis model is determined according to the accuracy of the second voice synthesis model for estimating the user data, so that the first voice synthesis model which is closest to the real voice of the specific target user and is output from the plurality of first voice synthesis models is used as the target voice synthesis model, and the accuracy of the determined target voice synthesis model is improved.

The user data to be inferred may include a plurality of user voices to be inferred and corresponding texts. It should be noted that the user speech to be presumed and the user sample data originate from the same user.

In an exemplary embodiment, for each second speech synthesis model, the accuracy of the inference of the user data to be inferred by the second speech synthesis model may be obtained by the following steps 306a-306 c.

Step 306a, aiming at each user voice to be presumed, inputting the corresponding text into the second voice synthesis model to obtain a voice presumption result.

Step 306b, performing dynamic time warping calculation on the voice of the user to be inferred and the corresponding voice inference result, and obtaining the distance between the voice of the user to be inferred and the corresponding voice inference result.

Wherein, Dynamic Time Warping (DTW) is used to measure the similarity between two Time sequences.

In the embodiment of the application, for each user voice to be presumed, corresponding text data is input into the second voice synthesis model, and after the voice presumption result is obtained, dynamic time warping calculation can be performed on the user voice to be presumed and the corresponding voice presumption result, so that the distance between the user voice to be presumed and the corresponding voice presumption result is obtained. The process of performing dynamic time warping calculation may refer to descriptions in related technologies, and is not described herein again.

It can be understood that, the greater the distance between the voice of the user to be presumed and the corresponding voice presumption result, the lower the similarity between the voice of the user to be presumed and the corresponding voice presumption result; the smaller the distance between the user voice to be presumed and the corresponding voice presumption result is, the higher the similarity between the user voice to be presumed and the corresponding voice presumption result is.

And step 306c, summing and averaging the distances between the multiple user voices to be inferred and the corresponding voice inference results, and determining the inference accuracy of the second voice synthesis model on the user data to be inferred according to the computation results.

It can be understood that, for each user voice to be inferred, the distance between the user voice to be inferred and the corresponding voice inference result can be obtained, so as to obtain the distances between the user voices to be inferred and the corresponding voice inference results, and further, the distances can be summed and averaged, and the inference accuracy of the second voice synthesis model for the user data to be inferred is determined according to the computation result.

In an exemplary embodiment, the greater the summed average calculation result is, the lower the inference accuracy of the second speech synthesis model on the user data to be inferred can be considered; the smaller the sum-average calculation result is, the more accurate the second speech synthesis model can assume to be guessed about the user data to be guessed.

By using a dynamic time warping algorithm, the distances between the multiple user voices to be presumed and the corresponding voice presumption results are obtained, and then the sum average calculation is carried out according to the distances between the multiple user voices to be presumed and the corresponding voice presumption results, so that the accuracy of presumption of the second voice synthesis model on the user data to be presumed is accurately determined according to the calculation results.

And 308, performing fine tuning training on the target voice synthesis model by adopting second sample data to obtain a trained voice synthesis model.

For a specific implementation process and principle of the step 308, reference may be made to the description of the foregoing embodiments, and details are not described here.

The training method of the speech synthesis model provided by the embodiment of the application firstly obtains user sample data, an initial speech synthesis model and corresponding pre-training data, wherein the user sample data comprises: dividing user sample data to obtain first sample data and second sample data, wherein the number of user voices in the first sample data is larger than that of the user voices in the second sample data, training an initial voice synthesis model by using the first sample data and pre-training data, acquiring a plurality of trained first voice synthesis models in the training process, acquiring loss function values of the plurality of first voice synthesis models, selecting a second voice synthesis model with a corresponding loss function value within a preset numerical range from the plurality of first voice synthesis models, acquiring the guessing accuracy of the second voice synthesis model on the user data to be guessed, and determining the second voice synthesis model corresponding to the maximum guessing accuracy in the guessing accuracy as a target voice synthesis model, and then, performing fine tuning training on the target voice synthesis model by adopting second sample data to obtain a trained voice synthesis model. Therefore, the voice synthesis model can be trained only by user sample data with small scale, synthesized voice corresponding to the text to be synthesized with high quality can be obtained through the voice synthesis model, time consumption in the voice synthesis process is short, cost is low, the requirement of generating personalized voice by a user is met, and user experience is improved.

The following describes a speech synthesis model training apparatus provided in the present application with reference to fig. 4.

Fig. 4 is a schematic structural diagram of a training apparatus for a speech synthesis model according to a fourth embodiment of the present application.

As shown in fig. 4, the present application provides a training apparatus 400 for a speech synthesis model, including: a first obtaining module 401, a dividing module 402, a first training module 403, a selecting module 404, and a second training module 405.

The first obtaining module 401 is configured to obtain user sample data, an initial speech synthesis model, and corresponding pre-training data, where the user sample data includes: a plurality of user voices and a text corresponding to each user voice;

a dividing module 402, configured to divide user sample data to obtain first sample data and second sample data, where the number of user voices in the first sample data is greater than the number of user voices in the second sample data;

a first training module 403, configured to train an initial speech synthesis model by using the first sample data and the pre-training data, and obtain a plurality of first speech synthesis models obtained through training in a training process;

a selection module 404 for selecting a target speech synthesis model from a plurality of first speech synthesis models; and

and a second training module 405, configured to perform fine tuning training on the target speech synthesis model by using second sample data, so as to obtain a trained speech synthesis model.

It should be noted that the training apparatus for a speech synthesis model provided in this embodiment may execute the training method for a speech synthesis model in the foregoing embodiment. The training device of the speech synthesis model can be an electronic device, and can also be configured in the electronic device, so that the speech synthesis model capable of outputting high-quality speech synthesis results can be trained by using less user sample data.

It should be noted that the foregoing description of the embodiment of the method for training a speech synthesis model is also applicable to the apparatus for training a speech synthesis model provided in the present application, and is not repeated herein.

The training device of the speech synthesis model provided by the embodiment of the application firstly obtains user sample data, an initial speech synthesis model and corresponding pre-training data, wherein, the user sample data comprises a plurality of user voices and texts corresponding to each user voice, then the user sample data is divided to obtain first sample data and second sample data, wherein the number of the user voices in the first sample data is larger than the number of the user voices in the second sample data, and then the first sample data and the pre-training data are adopted, training the initial voice synthesis model, acquiring a plurality of first voice synthesis models obtained through training in the training process, selecting a target voice synthesis model from the first voice synthesis models, and performing fine tuning training on the target voice synthesis model by adopting second sample data to obtain the trained voice synthesis model. Therefore, the voice synthesis model capable of outputting high-quality voice synthesis results can be trained only by user sample data with small scale, and the voice synthesis process is short in time consumption and low in cost.

The following describes a speech synthesis model training apparatus provided in the present application with reference to fig. 5.

Fig. 5 is a schematic structural diagram of a training apparatus for a speech synthesis model according to a fifth embodiment of the present application.

As shown in fig. 5, the training apparatus 500 for a speech synthesis model may specifically include a first obtaining module 501, a dividing module 502, a first training module 503, a selecting module 504, and a second training module 505, where 501 to 505 in fig. 5 have the same functions and structures as 401 to 405 in fig. 4.

In an exemplary embodiment, as shown in fig. 5, the first training module 503 may specifically include: an updating unit 5031, a training unit 5032, and an extracting unit 5033.

Wherein, the first training module 503 includes:

an updating unit 5031, configured to update the first sample data into the pre-training data to obtain updated pre-training data;

a training unit 5032, configured to train the initial speech synthesis model with the updated pre-training data;

an extracting unit 5033, configured to extract, during training of the initial speech synthesis model, the speech synthesis model obtained through training every preset number of steps as the first speech synthesis model.

In an exemplary embodiment, as shown in fig. 5, the selecting module 504 may include: a first acquisition unit 5041, a selection unit 5042, a second acquisition unit 5043, a determination unit 5044.

The first obtaining unit 5041 is configured to obtain loss function values of a plurality of first speech synthesis models;

a selecting unit 5042, configured to select a second speech synthesis model with a corresponding loss function value within a preset numerical range from the plurality of first speech synthesis models;

a second obtaining unit 5043, configured to obtain accuracy of inference of the user data to be inferred by the second speech synthesis model; and

a determining unit 5044, configured to determine the second speech synthesis model corresponding to the maximum inference accuracy of the inference accuracies as the target speech synthesis model.

In an exemplary embodiment, the user data to be inferred includes: the plurality of user voices to be inferred and the corresponding texts, and accordingly, the second obtaining unit 5043 may include: the device comprises an acquisition subunit, a first calculation subunit and a second calculation subunit.

The acquiring subunit is configured to, for each user voice to be inferred, input a corresponding text into the second voice synthesis model, and acquire a voice inference result;

the first calculation subunit is used for performing dynamic time warping calculation on the voice of the user to be inferred and the corresponding voice inference result to acquire the distance between the voice of the user to be inferred and the corresponding voice inference result; and

and the second calculation subunit is used for performing summation average calculation on the distances between the multiple user voices to be presumed and the corresponding voice presumption results, and determining the presumption accuracy of the second voice synthesis model on the user data to be presumed according to the calculation results.

In an exemplary embodiment, as shown in fig. 5, the training apparatus for a speech synthesis model may further include a second obtaining module 506 and a third obtaining module 507.

The second obtaining module 506 is configured to obtain a text to be synthesized;

the third obtaining module 507 is configured to input the text to be synthesized into the trained speech synthesis model, and obtain a synthesized speech corresponding to the text to be synthesized.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device for a method of training a speech synthesis model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of training a speech synthesis model provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the method of training a speech synthesis model provided herein.

The memory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the speech synthesis model in the embodiment of the present application (for example, the first obtaining module 401, the dividing module 402, the first training module 403, the selecting module 404, and the second training module 405 shown in fig. 4). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, namely, implements the training method of the speech synthesis model in the above method embodiment.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the electronic device for training of the speech synthesis model, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to an electronic device for training of speech synthesis models. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the training method of the speech synthesis model may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus for training of the speech synthesis model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in the conventional physical host and VPS (Virtual Private Server) service. The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to an embodiment of the present application, there is also provided a computer program product, including a computer program, which when executed by a processor, is capable of implementing the training method of the speech synthesis model of the embodiment of the present application.

It should be noted that artificial intelligence is a subject of research that makes a computer simulate some human thinking process and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises computer vision, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

According to the technical scheme of the embodiment of the application, the voice synthesis model capable of outputting the high-quality voice synthesis result can be trained only by user sample data with small scale, the time consumption of the voice synthesis process is short, and the cost is low.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of training a speech synthesis model, comprising:

acquiring user sample data, an initial speech synthesis model and corresponding pre-training data, wherein the user sample data comprises: a plurality of user voices and a text corresponding to each user voice;

dividing the user sample data to obtain first sample data and second sample data, wherein the number of user voices in the first sample data is larger than that in the second sample data;

training the initial speech synthesis model by using the first sample data and the pre-training data, and acquiring a plurality of first speech synthesis models obtained by training in the training process;

selecting a target speech synthesis model from the plurality of first speech synthesis models; and

and performing fine tuning training on the target voice synthesis model by adopting the second sample data to obtain a trained voice synthesis model.

2. The method for training a speech synthesis model according to claim 1, wherein the training of the initial speech synthesis model using the first sample data and the pre-training data and the obtaining of a plurality of first speech synthesis models obtained by training in the training process comprises:

updating the first sample data into the pre-training data to obtain updated pre-training data;

training the initial speech synthesis model by using the updated pre-training data;

and in the training process of the initial voice synthesis model, extracting the voice synthesis model obtained by training every other preset step number to serve as a first voice synthesis model.

3. The method of training a speech synthesis model according to claim 1, wherein said selecting a target speech synthesis model from the plurality of first speech synthesis models comprises:

obtaining loss function values of the plurality of first voice synthesis models;

selecting a second speech synthesis model of which the corresponding loss function value is within a preset numerical range from the plurality of first speech synthesis models;

acquiring the guessing accuracy of the second voice synthesis model to the user data to be guessed; and

and determining a second speech synthesis model corresponding to the maximum inference accuracy in the inference accuracies as the target speech synthesis model.

4. The method of training a speech synthesis model according to claim 3, wherein the user data to be inferred comprises: a plurality of user voices to be presumed and corresponding texts;

the obtaining of the accuracy of the second speech synthesis model for estimating the user data to be estimated includes:

inputting corresponding text into the second voice synthesis model aiming at each user voice to be presumed to obtain a voice presumption result;

performing dynamic time warping calculation on the user voice to be presumed and the corresponding voice presumption result to obtain the distance between the user voice to be presumed and the corresponding voice presumption result; and

and performing summation average calculation on the distances between the multiple user voices to be inferred and the corresponding voice inference results, and determining the inference accuracy of the second voice synthesis model on the user data to be inferred according to the calculation results.

5. The method for training a speech synthesis model according to claim 1, wherein after performing fine tuning training on the target speech synthesis model using the second sample data to obtain a trained speech synthesis model, the method further comprises:

acquiring a text to be synthesized;

and inputting the text to be synthesized into the trained voice synthesis model, and acquiring the synthesized voice corresponding to the text to be synthesized.

6. An apparatus for training a speech synthesis model, comprising:

a first obtaining module, configured to obtain user sample data, an initial speech synthesis model, and corresponding pre-training data, where the user sample data includes: a plurality of user voices and a text corresponding to each user voice;

the dividing module is used for dividing the user sample data to obtain first sample data and second sample data, wherein the number of user voices in the first sample data is larger than the number of user voices in the second sample data;

the first training module is used for training the initial speech synthesis model by adopting the first sample data and the pre-training data and acquiring a plurality of first speech synthesis models obtained by training in the training process;

a selection module for selecting a target speech synthesis model from the plurality of first speech synthesis models; and

and the second training module is used for performing fine tuning training on the target voice synthesis model by adopting the second sample data to obtain a trained voice synthesis model.

7. The training apparatus of a speech synthesis model according to claim 6, wherein the first training module comprises:

the updating unit is used for updating the first sample data into the pre-training data to obtain updated pre-training data;

a training unit, configured to train the initial speech synthesis model by using the updated pre-training data;

and the extracting unit is used for extracting the trained voice synthesis model as a first voice synthesis model at intervals of a preset step number in the training process of the initial voice synthesis model.

8. The apparatus for training a speech synthesis model according to claim 6, wherein the selection module comprises:

a first obtaining unit configured to obtain loss function values of the plurality of first speech synthesis models;

the selection unit is used for selecting a second speech synthesis model of which the corresponding loss function value is within a preset numerical range from the plurality of first speech synthesis models;

the second acquisition unit is used for acquiring the speculation accuracy of the second voice synthesis model on the user data to be speculated; and

a determining unit, configured to determine a second speech synthesis model corresponding to a maximum inference accuracy among the inference accuracies as the target speech synthesis model.

9. The apparatus for training a speech synthesis model according to claim 3, wherein the user data to be inferred comprises: a plurality of user voices to be presumed and corresponding texts;

the second acquisition unit includes:

the obtaining subunit is configured to, for each user voice to be inferred, input a corresponding text into the second voice synthesis model, and obtain a voice inference result;

the first calculating subunit is configured to perform dynamic time warping calculation on the user voice to be inferred and the corresponding voice inference result, and obtain a distance between the user voice to be inferred and the corresponding voice inference result; and

10. The apparatus for training a speech synthesis model according to claim 6, further comprising:

the second acquisition module is used for acquiring a text to be synthesized;

and the third acquisition module is used for inputting the text to be synthesized into the trained voice synthesis model and acquiring the synthesized voice corresponding to the text to be synthesized.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program, wherein the computer program realizes the steps of the method of any one of claims 1-5 when executed by a processor.