CN113160791A

CN113160791A - Voice synthesis method and device, electronic equipment and storage medium

Info

Publication number: CN113160791A
Application number: CN202110496900.XA
Authority: CN
Inventors: 刘树勇; 吴俊仪; 蔡玉玉; 张政臣
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2021-07-23

Abstract

The invention provides a voice synthesis method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user. The method utilizes the specific voice model of the user to store the specific voice model in the appointed voice synthesis server in advance, when the user provides a voice synthesis service request, the corresponding voice synthesis server is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is low, and the calling process is simple.

Description

Voice synthesis method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.

Background

In recent years, artificial intelligence technology has been developed at a high speed, the man-machine interaction process has been redefined along with the technological progress, and the man-machine interaction means has been transformed from general mechanical instruction input to voice manipulation based on the existing voice recognition and voice synthesis technology. The voice synthesis technology synthesizes the generated response text into voice in the human-computer interaction process for playing, and the same text can synthesize different audios by using different types of tone models, so that much interest is added for voice synthesis.

At present, the timbre of the existing voice synthesis system is based on a deep learning technology, and the effect of repeated carving of the timbre is achieved by using a large number of pre-recorded audios of the same speaker for supervised training and simulating the tone, the phoneme, the sounding characteristics and other elements of the speaker. Part of the audio books are synthesized by voice, and the set timbre is selected to convert the text in the book into voice for playing, so that the traditional reading mode is replaced, and the vision burden of the user is lightened.

Although the existing speech synthesis method is mature, due to the limitation of the prior art, the cost is high for a common user to make a dedicated tone by himself, the calling process is complicated, and the server cannot provide speech synthesis service for a single user.

Disclosure of Invention

The invention provides a voice synthesis method and a voice synthesis device, which are used for solving the defects that in the prior art, the cost is high when a common user makes a dedicated tone by himself, and the calling process is complicated, and realizing the purpose of providing voice synthesis service for a single user.

The invention provides a speech synthesis method, which comprises the following steps:

acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized;

determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information;

forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;

returning the speech to the user.

According to the voice synthesis method provided by the invention, the service request comprises a user name and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;

determining a speech synthesis server storing a speech model corresponding to the user in a speech synthesis service cluster according to the service request and pre-stored routing information, comprising:

searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;

and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.

According to a speech synthesis method provided by the present invention, the method further comprises:

acquiring audio data of a user, and uploading the audio data to a third-party file cluster;

downloading the audio data from the third party file cluster;

and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.

According to a speech synthesis method provided by the present invention, the training of the speech model corresponding to the user according to the audio data and storing the speech model in a speech synthesis server, and generating corresponding routing information includes:

training a voice model corresponding to the user according to the audio data based on a neural network model;

uploading the speech model to the third-party file cluster;

selecting a speech synthesis server in a speech synthesis service cluster;

downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;

and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.

According to a speech synthesis method provided by the present invention, the selecting a speech synthesis server in a speech synthesis service cluster comprises:

counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;

and selecting the speech synthesis server with the least number of speech models to download the user speech models.

The present invention also provides a speech synthesis apparatus comprising:

the system comprises an acquisition service request module, a voice synthesis module and a voice synthesis module, wherein the acquisition service request module is used for acquiring a service request of a user, and the service request comprises a target text of voice to be synthesized;

the determining server module is used for determining a voice synthesis server for storing the voice model corresponding to the user in a voice synthesis service cluster according to the service request and the pre-stored routing information;

a speech synthesis module to:

returning the speech to the user.

According to a speech synthesis apparatus provided by the present invention, the service request includes a user name and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;

the determination server module is further to:

According to a speech synthesis apparatus provided by the present invention, the apparatus further comprises:

the data acquisition module is used for acquiring audio data of a user and uploading the audio data to a third-party file cluster;

a model training and storage module for:

downloading the audio data from the third party file cluster;

According to the speech synthesis apparatus provided by the present invention, the model training and storing module is further configured to:

uploading the speech model to the third-party file cluster;

selecting a speech synthesis server in a speech synthesis service cluster;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the speech synthesis method as described in any one of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech synthesis method as described in any one of the above.

The voice synthesis method and the voice synthesis device provided by the invention have the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user provides a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a speech synthesis method provided by the present invention;

FIG. 2 is a schematic flow chart of a speech synthesis server for determining a speech model corresponding to a stored user according to the present invention;

FIG. 3 is a schematic flow chart of the present invention for training a specific speech model based on audio data of a user;

FIG. 4 is a schematic flow chart illustrating a process of storing a speech model corresponding to a trained user according to audio data in a speech synthesis server and generating corresponding routing information according to the present invention;

FIG. 5 is a schematic flow chart of selecting a speech synthesis server in a speech synthesis service cluster according to the present invention;

FIG. 6 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a speech synthesis method provided by the present invention, as shown in fig. 1, the method includes:

step 110, a service request of a user is obtained, wherein the service request comprises a target text of the voice to be synthesized.

The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis.

Step 120, determining a speech synthesis server storing the speech model corresponding to the user in a speech synthesis service cluster according to the service request and the pre-stored routing information.

The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.

The invention uses the routing technology of the special gateway to replace the database query technology commonly used in the prior art, so that the user request is quickly positioned to the appointed server.

Step 130, forwarding the service request to the speech synthesis server, so that the speech synthesis server synthesizes the target text into speech.

Step 140, returning the speech to the user.

The voice synthesis method provided by the invention has the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user puts forward a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple.

Fig. 2 is a schematic flowchart of a process for determining a speech synthesis server storing a speech model corresponding to a user according to an embodiment of the present invention, as shown in fig. 2, including:

the service request includes a username and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;

step 210, searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID.

Step 220, determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the voice synthesis server ID.

It should be noted that, a user may have multiple speech models, and when the user makes a service request for speech synthesis, the server needs to obtain a specific speech model selected by the user, and since the speech model ID is a unique identifier of the speech model, route information should be synthesized according to the speech model ID to help quickly determine a storage location of the speech model selected by the current user, that is, a server corresponding to the service for speech synthesis.

The invention uses the routing list to store the mapping relation between the voice model and the voice synthesis server, so that the service request of the user can be quickly positioned to the appointed server, and the service efficiency is improved.

FIG. 3 is a flowchart illustrating a process of training a specific speech model based on audio data of a user according to an embodiment of the present invention, as shown in FIG. 3, including:

and step 310, acquiring the audio data of the user, and uploading the audio data to a third-party file cluster.

The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis. If the voice model of the current user is not stored in the system, the user needs to record through client software after logging in the system, the user uploads the audio data to an audio management service of a server after acquiring the audio data, and the audio management service uploads the audio data to a third-party file cluster such as an OSS for storage and backup.

In addition, the invention stipulates the speech technique used when the user inputs the audio data, make the user needn't record the training corpus for a long time, can extract the specialized characteristic of the user's pronunciation, have improved the similarity that the tone is customized, because the training corpus quantity is smaller, make training time shorten greatly, have raised the working efficiency of the whole speech synthesis training.

Step 320, downloading the audio data from the third party file cluster.

The audio management service starts an asynchronous command, informs the deep machine learning service to train a voice model of a user, receives the command, and downloads audio data from the third-party file cluster to prepare for model training.

Step 330, training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.

The special characteristics of the user tone can be extracted from the trained voice model, the voice model is stored in the appointed voice synthesis server, and the target text is converted into the voice with the user tone characteristics when the voice synthesis service request of the user is received.

Fig. 4 is a schematic flowchart illustrating a process of training a speech model corresponding to a user according to audio data, storing the speech model in a speech synthesis server, and generating corresponding routing information, according to an embodiment of the present invention, as shown in fig. 4, including:

and step 410, training a voice model corresponding to the user according to the audio data based on a neural network model.

The deep machine learning service is based on a neural network model, extracts the special characteristics of the user according to a small amount of audio data recorded by the user, and combines the special characteristics with a voice model which is trained in advance by using general data to obtain the special model of the specific user.

Step 420, uploading the voice model to the third-party file cluster.

And uploading the trained proprietary speech model to a third-party file cluster by the deep machine learning service.

Step 430, a speech synthesis server is selected in the speech synthesis service cluster.

Step 440, downloading the speech model from the third-party file cluster, and storing the speech model in the speech synthesis server.

The deep machine learning service sends asynchronous messages, the messages of the completion of the training of the voice model are notified to the voice synthesis management service, the voice synthesis management service downloads the current voice model from a third-party file cluster, a voice synthesis server is selected from the voice synthesis service cluster, and the voice synthesis server is notified to load the voice model.

The specific voice synthesis server can be selected according to attributes such as the number of voice models stored in the server and the types of users to which the models belong, or the selection conditions of the voice synthesis server can be set according to the target in the current application scene, so that the efficiency of the voice synthesis service is improved.

And step 450, synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.

The invention appoints the server for different users, and realizes the provision of voice synthesis service for a single user.

Fig. 5 is a flowchart illustrating a process for selecting a speech synthesis server in a speech synthesis service cluster according to an embodiment of the present invention, as shown in fig. 5, including:

step 510, counting the number of speech models of each speech synthesis server in the speech synthesis service cluster.

Step 520, selecting the speech synthesis server with the least number of speech models to download the user speech models.

And selecting the speech synthesis server with the least number of models to store the newly trained speech models, balancing the load of the speech synthesis service cluster, improving the processing efficiency of the speech synthesis service and improving the use experience of users.

A speech synthesis process in an application scenario in the embodiment of the present invention is described below.

The application scene is set to customize the parent-child talking book, the user only needs to record a plurality of sentences of audio, and the trained tone can be added into the alternative tone of the talking book.

Step 1, a user registers and logs in a server through client software, a tone library is set in the system, and the tone library comprises common tones for customizing parent-child talking books.

And 2, selecting a preset tone by the user, or recording the audio to create a tone set of the user.

And 3, uploading the books and the text fragments by the user, or selecting the common text in the system text library as the target text of the voice to be synthesized.

And 4, selecting a specific tone from the tone collection by the user, and starting to customize the audio book.

And 5, after the audio book is customized, the user can browse and play the audio book in the audio book collection.

And 6, the user can create an alternative tone color set for the specific audio book, namely, the same text corresponds to different tone colors for the user to select to play.

The voice synthesis method provided by the invention has the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user puts forward a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple. Meanwhile, due to the fact that the user input voice frequency data is specified, the special characteristics of the user voice can be extracted according to a small amount of voice frequency data, the similarity of tone customization is improved, the time of voice frequency recording and expected training is shortened, and the overall efficiency of voice synthesis service is improved. In addition, due to the convenience of voice synthesis for specific timbres, the user can select more timbres for synthesis besides preset timbres, and the use experience of the user is improved. Moreover, the service efficiency is improved on the whole for the server with the highest synthesis efficiency to finish the speech synthesis service without using the speech model.

The following describes a speech synthesis apparatus provided by the present invention, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to correspondingly.

Fig. 6 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention, as shown in fig. 6, the apparatus includes:

an obtaining service request module 610, configured to obtain a service request of a user, where the service request includes a target text of a speech to be synthesized.

And the determining server module 620 is configured to determine, according to the service request and the pre-stored routing information, a speech synthesis server storing the speech model corresponding to the user in a speech synthesis service cluster.

A speech synthesis module 630 configured to:

returning the speech to the user.

The voice synthesis device provided by the invention stores the specific voice model of the user in the specified voice synthesis server in advance, generates the routing information based on the voice model and the voice synthesis server, directly finds the voice synthesis server corresponding to the user through the routing information to provide service for the user when the user provides a service request of voice synthesis, realizes the purpose of providing voice synthesis service for a single user, and has lower manufacturing cost and simple calling process.

According to one embodiment of the invention, the service request includes a username and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;

the determination server module 620 is further configured to:

According to an embodiment of the invention, the apparatus further comprises:

and the data acquisition module is used for acquiring the audio data of the user and uploading the audio data to the third-party file cluster.

If the voice model of the current user is not stored in the system, the user needs to record through client software after logging in the system, the user uploads the audio data to an audio management service of a server after acquiring the audio data, and the audio management service uploads the audio data to a third-party file cluster such as an OSS for storage and backup.

A model training and storage module for:

downloading the audio data from the third party file cluster;

and storing the voice model corresponding to the user to a voice synthesis server according to the audio data, and generating corresponding routing information.

Specifically, the audio management service starts an asynchronous command, informs the deep machine learning service to train a voice model of the user, and the deep machine learning service receives the command and downloads audio data from the third-party file cluster to prepare for model training.

According to an embodiment of the present invention, the model training and storage module is further configured to:

uploading the speech model to the third-party file cluster;

selecting a speech synthesis server in a speech synthesis service cluster;

Specifically, the deep machine learning service is based on a neural network model, extracts the special characteristics of the user according to a small amount of audio data recorded by the user, and combines the special characteristics with a voice model trained in advance by using general data to obtain the special model of the specific user. And uploading the trained proprietary speech model to a third-party file cluster by the deep machine learning service.

The voice synthesis device provided by the invention stores the specific voice model of the user in the specified voice synthesis server in advance, generates the routing information based on the voice model and the voice synthesis server, directly finds the voice synthesis server corresponding to the user through the routing information to provide service for the user when the user provides a service request of voice synthesis, realizes the purpose of providing voice synthesis service for a single user, and has lower manufacturing cost and simple calling process. Meanwhile, due to the fact that the user input voice frequency data is specified, the special characteristics of the user voice can be extracted according to a small amount of voice frequency data, the similarity of tone customization is improved, the time of voice frequency recording and expected training is shortened, and the overall efficiency of voice synthesis service is improved. In addition, due to the convenience of voice synthesis for specific timbres, the user can select more timbres for synthesis besides preset timbres, and the use experience of the user is improved. Moreover, the service efficiency is improved on the whole for the server with the highest synthesis efficiency to finish the speech synthesis service without using the speech model.

Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a speech synthesis method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a speech synthesis method provided by the above methods, the method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform a speech synthesis method provided by the above methods, the method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speech synthesis, comprising:

returning the speech to the user.

2. The speech synthesis method of claim 1, wherein the service request includes a username and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;

3. The speech synthesis method of claim 2, wherein the method further comprises:

downloading the audio data from the third party file cluster;

4. The method of claim 3, wherein training the corresponding speech model of the user according to the audio data to be stored in a speech synthesis server and generating corresponding routing information comprises:

uploading the speech model to the third-party file cluster;

selecting a speech synthesis server in a speech synthesis service cluster;

5. The method of claim 4, wherein selecting a speech synthesis server in the speech synthesis service cluster comprises:

6. A speech synthesis apparatus, comprising:

a speech synthesis module to:

returning the speech to the user.

7. The speech synthesis apparatus of claim 6, wherein the service request comprises a username and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;

the determination server module is further to:

8. The speech synthesis apparatus of claim 7, wherein the apparatus further comprises:

a model training and storage module for:

downloading the audio data from the third party file cluster;

9. The speech synthesis apparatus of claim 8, wherein the model training and storage module is further configured to:

uploading the speech model to the third-party file cluster;

selecting a speech synthesis server in a speech synthesis service cluster;

10. The speech synthesis apparatus of claim 8, wherein the model training and storage module is further configured to:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech synthesis method according to any of claims 1 to 5 are implemented when the program is executed by the processor.

12. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech synthesis method according to any one of claims 1 to 5.