CN113160791A - Voice synthesis method and device, electronic equipment and storage medium - Google Patents

Voice synthesis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113160791A
CN113160791A CN202110496900.XA CN202110496900A CN113160791A CN 113160791 A CN113160791 A CN 113160791A CN 202110496900 A CN202110496900 A CN 202110496900A CN 113160791 A CN113160791 A CN 113160791A
Authority
CN
China
Prior art keywords
voice
speech
user
model
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110496900.XA
Other languages
Chinese (zh)
Inventor
刘树勇
吴俊仪
蔡玉玉
张政臣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JD Digital Technology Holdings Co Ltd
Original Assignee
JD Digital Technology Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JD Digital Technology Holdings Co Ltd filed Critical JD Digital Technology Holdings Co Ltd
Priority to CN202110496900.XA priority Critical patent/CN113160791A/en
Publication of CN113160791A publication Critical patent/CN113160791A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Abstract

The invention provides a voice synthesis method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user. The method utilizes the specific voice model of the user to store the specific voice model in the appointed voice synthesis server in advance, when the user provides a voice synthesis service request, the corresponding voice synthesis server is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is low, and the calling process is simple.

Description

Voice synthesis method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.
Background
In recent years, artificial intelligence technology has been developed at a high speed, the man-machine interaction process has been redefined along with the technological progress, and the man-machine interaction means has been transformed from general mechanical instruction input to voice manipulation based on the existing voice recognition and voice synthesis technology. The voice synthesis technology synthesizes the generated response text into voice in the human-computer interaction process for playing, and the same text can synthesize different audios by using different types of tone models, so that much interest is added for voice synthesis.
At present, the timbre of the existing voice synthesis system is based on a deep learning technology, and the effect of repeated carving of the timbre is achieved by using a large number of pre-recorded audios of the same speaker for supervised training and simulating the tone, the phoneme, the sounding characteristics and other elements of the speaker. Part of the audio books are synthesized by voice, and the set timbre is selected to convert the text in the book into voice for playing, so that the traditional reading mode is replaced, and the vision burden of the user is lightened.
Although the existing speech synthesis method is mature, due to the limitation of the prior art, the cost is high for a common user to make a dedicated tone by himself, the calling process is complicated, and the server cannot provide speech synthesis service for a single user.
Disclosure of Invention
The invention provides a voice synthesis method and a voice synthesis device, which are used for solving the defects that in the prior art, the cost is high when a common user makes a dedicated tone by himself, and the calling process is complicated, and realizing the purpose of providing voice synthesis service for a single user.
The invention provides a speech synthesis method, which comprises the following steps:
acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized;
determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information;
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
According to the voice synthesis method provided by the invention, the service request comprises a user name and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
determining a speech synthesis server storing a speech model corresponding to the user in a speech synthesis service cluster according to the service request and pre-stored routing information, comprising:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
According to a speech synthesis method provided by the present invention, the method further comprises:
acquiring audio data of a user, and uploading the audio data to a third-party file cluster;
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
According to a speech synthesis method provided by the present invention, the training of the speech model corresponding to the user according to the audio data and storing the speech model in a speech synthesis server, and generating corresponding routing information includes:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
According to a speech synthesis method provided by the present invention, the selecting a speech synthesis server in a speech synthesis service cluster comprises:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
The present invention also provides a speech synthesis apparatus comprising:
the system comprises an acquisition service request module, a voice synthesis module and a voice synthesis module, wherein the acquisition service request module is used for acquiring a service request of a user, and the service request comprises a target text of voice to be synthesized;
the determining server module is used for determining a voice synthesis server for storing the voice model corresponding to the user in a voice synthesis service cluster according to the service request and the pre-stored routing information;
a speech synthesis module to:
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
According to a speech synthesis apparatus provided by the present invention, the service request includes a user name and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
the determination server module is further to:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
According to a speech synthesis apparatus provided by the present invention, the apparatus further comprises:
the data acquisition module is used for acquiring audio data of a user and uploading the audio data to a third-party file cluster;
a model training and storage module for:
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
According to the speech synthesis apparatus provided by the present invention, the model training and storing module is further configured to:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
According to the speech synthesis apparatus provided by the present invention, the model training and storing module is further configured to:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the speech synthesis method as described in any one of the above when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech synthesis method as described in any one of the above.
The voice synthesis method and the voice synthesis device provided by the invention have the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user provides a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a speech synthesis method provided by the present invention;
FIG. 2 is a schematic flow chart of a speech synthesis server for determining a speech model corresponding to a stored user according to the present invention;
FIG. 3 is a schematic flow chart of the present invention for training a specific speech model based on audio data of a user;
FIG. 4 is a schematic flow chart illustrating a process of storing a speech model corresponding to a trained user according to audio data in a speech synthesis server and generating corresponding routing information according to the present invention;
FIG. 5 is a schematic flow chart of selecting a speech synthesis server in a speech synthesis service cluster according to the present invention;
FIG. 6 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a speech synthesis method provided by the present invention, as shown in fig. 1, the method includes:
step 110, a service request of a user is obtained, wherein the service request comprises a target text of the voice to be synthesized.
The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis.
Step 120, determining a speech synthesis server storing the speech model corresponding to the user in a speech synthesis service cluster according to the service request and the pre-stored routing information.
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
The invention uses the routing technology of the special gateway to replace the database query technology commonly used in the prior art, so that the user request is quickly positioned to the appointed server.
Step 130, forwarding the service request to the speech synthesis server, so that the speech synthesis server synthesizes the target text into speech.
Step 140, returning the speech to the user.
The voice synthesis method provided by the invention has the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user puts forward a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple.
Fig. 2 is a schematic flowchart of a process for determining a speech synthesis server storing a speech model corresponding to a user according to an embodiment of the present invention, as shown in fig. 2, including:
the service request includes a username and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
step 210, searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID.
Step 220, determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the voice synthesis server ID.
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
It should be noted that, a user may have multiple speech models, and when the user makes a service request for speech synthesis, the server needs to obtain a specific speech model selected by the user, and since the speech model ID is a unique identifier of the speech model, route information should be synthesized according to the speech model ID to help quickly determine a storage location of the speech model selected by the current user, that is, a server corresponding to the service for speech synthesis.
The invention uses the routing list to store the mapping relation between the voice model and the voice synthesis server, so that the service request of the user can be quickly positioned to the appointed server, and the service efficiency is improved.
FIG. 3 is a flowchart illustrating a process of training a specific speech model based on audio data of a user according to an embodiment of the present invention, as shown in FIG. 3, including:
and step 310, acquiring the audio data of the user, and uploading the audio data to a third-party file cluster.
The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis. If the voice model of the current user is not stored in the system, the user needs to record through client software after logging in the system, the user uploads the audio data to an audio management service of a server after acquiring the audio data, and the audio management service uploads the audio data to a third-party file cluster such as an OSS for storage and backup.
In addition, the invention stipulates the speech technique used when the user inputs the audio data, make the user needn't record the training corpus for a long time, can extract the specialized characteristic of the user's pronunciation, have improved the similarity that the tone is customized, because the training corpus quantity is smaller, make training time shorten greatly, have raised the working efficiency of the whole speech synthesis training.
Step 320, downloading the audio data from the third party file cluster.
The audio management service starts an asynchronous command, informs the deep machine learning service to train a voice model of a user, receives the command, and downloads audio data from the third-party file cluster to prepare for model training.
Step 330, training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
The special characteristics of the user tone can be extracted from the trained voice model, the voice model is stored in the appointed voice synthesis server, and the target text is converted into the voice with the user tone characteristics when the voice synthesis service request of the user is received.
Fig. 4 is a schematic flowchart illustrating a process of training a speech model corresponding to a user according to audio data, storing the speech model in a speech synthesis server, and generating corresponding routing information, according to an embodiment of the present invention, as shown in fig. 4, including:
and step 410, training a voice model corresponding to the user according to the audio data based on a neural network model.
The deep machine learning service is based on a neural network model, extracts the special characteristics of the user according to a small amount of audio data recorded by the user, and combines the special characteristics with a voice model which is trained in advance by using general data to obtain the special model of the specific user.
Step 420, uploading the voice model to the third-party file cluster.
And uploading the trained proprietary speech model to a third-party file cluster by the deep machine learning service.
Step 430, a speech synthesis server is selected in the speech synthesis service cluster.
Step 440, downloading the speech model from the third-party file cluster, and storing the speech model in the speech synthesis server.
The deep machine learning service sends asynchronous messages, the messages of the completion of the training of the voice model are notified to the voice synthesis management service, the voice synthesis management service downloads the current voice model from a third-party file cluster, a voice synthesis server is selected from the voice synthesis service cluster, and the voice synthesis server is notified to load the voice model.
The specific voice synthesis server can be selected according to attributes such as the number of voice models stored in the server and the types of users to which the models belong, or the selection conditions of the voice synthesis server can be set according to the target in the current application scene, so that the efficiency of the voice synthesis service is improved.
And step 450, synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
The invention appoints the server for different users, and realizes the provision of voice synthesis service for a single user.
Fig. 5 is a flowchart illustrating a process for selecting a speech synthesis server in a speech synthesis service cluster according to an embodiment of the present invention, as shown in fig. 5, including:
step 510, counting the number of speech models of each speech synthesis server in the speech synthesis service cluster.
Step 520, selecting the speech synthesis server with the least number of speech models to download the user speech models.
And selecting the speech synthesis server with the least number of models to store the newly trained speech models, balancing the load of the speech synthesis service cluster, improving the processing efficiency of the speech synthesis service and improving the use experience of users.
A speech synthesis process in an application scenario in the embodiment of the present invention is described below.
The application scene is set to customize the parent-child talking book, the user only needs to record a plurality of sentences of audio, and the trained tone can be added into the alternative tone of the talking book.
Step 1, a user registers and logs in a server through client software, a tone library is set in the system, and the tone library comprises common tones for customizing parent-child talking books.
And 2, selecting a preset tone by the user, or recording the audio to create a tone set of the user.
And 3, uploading the books and the text fragments by the user, or selecting the common text in the system text library as the target text of the voice to be synthesized.
And 4, selecting a specific tone from the tone collection by the user, and starting to customize the audio book.
And 5, after the audio book is customized, the user can browse and play the audio book in the audio book collection.
And 6, the user can create an alternative tone color set for the specific audio book, namely, the same text corresponds to different tone colors for the user to select to play.
The voice synthesis method provided by the invention has the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user puts forward a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple. Meanwhile, due to the fact that the user input voice frequency data is specified, the special characteristics of the user voice can be extracted according to a small amount of voice frequency data, the similarity of tone customization is improved, the time of voice frequency recording and expected training is shortened, and the overall efficiency of voice synthesis service is improved. In addition, due to the convenience of voice synthesis for specific timbres, the user can select more timbres for synthesis besides preset timbres, and the use experience of the user is improved. Moreover, the service efficiency is improved on the whole for the server with the highest synthesis efficiency to finish the speech synthesis service without using the speech model.
The following describes a speech synthesis apparatus provided by the present invention, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to correspondingly.
Fig. 6 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention, as shown in fig. 6, the apparatus includes:
an obtaining service request module 610, configured to obtain a service request of a user, where the service request includes a target text of a speech to be synthesized.
The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis.
And the determining server module 620 is configured to determine, according to the service request and the pre-stored routing information, a speech synthesis server storing the speech model corresponding to the user in a speech synthesis service cluster.
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
The invention uses the routing technology of the special gateway to replace the database query technology commonly used in the prior art, so that the user request is quickly positioned to the appointed server.
A speech synthesis module 630 configured to:
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
The voice synthesis device provided by the invention stores the specific voice model of the user in the specified voice synthesis server in advance, generates the routing information based on the voice model and the voice synthesis server, directly finds the voice synthesis server corresponding to the user through the routing information to provide service for the user when the user provides a service request of voice synthesis, realizes the purpose of providing voice synthesis service for a single user, and has lower manufacturing cost and simple calling process.
According to one embodiment of the invention, the service request includes a username and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
the determination server module 620 is further configured to:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
It should be noted that, a user may have multiple speech models, and when the user makes a service request for speech synthesis, the server needs to obtain a specific speech model selected by the user, and since the speech model ID is a unique identifier of the speech model, route information should be synthesized according to the speech model ID to help quickly determine a storage location of the speech model selected by the current user, that is, a server corresponding to the service for speech synthesis.
The invention uses the routing list to store the mapping relation between the voice model and the voice synthesis server, so that the service request of the user can be quickly positioned to the appointed server, and the service efficiency is improved.
According to an embodiment of the invention, the apparatus further comprises:
and the data acquisition module is used for acquiring the audio data of the user and uploading the audio data to the third-party file cluster.
If the voice model of the current user is not stored in the system, the user needs to record through client software after logging in the system, the user uploads the audio data to an audio management service of a server after acquiring the audio data, and the audio management service uploads the audio data to a third-party file cluster such as an OSS for storage and backup.
In addition, the invention stipulates the speech technique used when the user inputs the audio data, make the user needn't record the training corpus for a long time, can extract the specialized characteristic of the user's pronunciation, have improved the similarity that the tone is customized, because the training corpus quantity is smaller, make training time shorten greatly, have raised the working efficiency of the whole speech synthesis training.
A model training and storage module for:
downloading the audio data from the third party file cluster;
and storing the voice model corresponding to the user to a voice synthesis server according to the audio data, and generating corresponding routing information.
Specifically, the audio management service starts an asynchronous command, informs the deep machine learning service to train a voice model of the user, and the deep machine learning service receives the command and downloads audio data from the third-party file cluster to prepare for model training.
The special characteristics of the user tone can be extracted from the trained voice model, the voice model is stored in the appointed voice synthesis server, and the target text is converted into the voice with the user tone characteristics when the voice synthesis service request of the user is received.
According to an embodiment of the present invention, the model training and storage module is further configured to:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
Specifically, the deep machine learning service is based on a neural network model, extracts the special characteristics of the user according to a small amount of audio data recorded by the user, and combines the special characteristics with a voice model trained in advance by using general data to obtain the special model of the specific user. And uploading the trained proprietary speech model to a third-party file cluster by the deep machine learning service.
The deep machine learning service sends asynchronous messages, the messages of the completion of the training of the voice model are notified to the voice synthesis management service, the voice synthesis management service downloads the current voice model from a third-party file cluster, a voice synthesis server is selected from the voice synthesis service cluster, and the voice synthesis server is notified to load the voice model.
The specific voice synthesis server can be selected according to attributes such as the number of voice models stored in the server and the types of users to which the models belong, or the selection conditions of the voice synthesis server can be set according to the target in the current application scene, so that the efficiency of the voice synthesis service is improved.
The invention appoints the server for different users, and realizes the provision of voice synthesis service for a single user.
According to an embodiment of the present invention, the model training and storage module is further configured to:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
And selecting the speech synthesis server with the least number of models to store the newly trained speech models, balancing the load of the speech synthesis service cluster, improving the processing efficiency of the speech synthesis service and improving the use experience of users.
The voice synthesis device provided by the invention stores the specific voice model of the user in the specified voice synthesis server in advance, generates the routing information based on the voice model and the voice synthesis server, directly finds the voice synthesis server corresponding to the user through the routing information to provide service for the user when the user provides a service request of voice synthesis, realizes the purpose of providing voice synthesis service for a single user, and has lower manufacturing cost and simple calling process. Meanwhile, due to the fact that the user input voice frequency data is specified, the special characteristics of the user voice can be extracted according to a small amount of voice frequency data, the similarity of tone customization is improved, the time of voice frequency recording and expected training is shortened, and the overall efficiency of voice synthesis service is improved. In addition, due to the convenience of voice synthesis for specific timbres, the user can select more timbres for synthesis besides preset timbres, and the use experience of the user is improved. Moreover, the service efficiency is improved on the whole for the server with the highest synthesis efficiency to finish the speech synthesis service without using the speech model.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a speech synthesis method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a speech synthesis method provided by the above methods, the method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform a speech synthesis method provided by the above methods, the method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (12)

1. A method of speech synthesis, comprising:
acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized;
determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information;
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
2. The speech synthesis method of claim 1, wherein the service request includes a username and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
determining a speech synthesis server storing a speech model corresponding to the user in a speech synthesis service cluster according to the service request and pre-stored routing information, comprising:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
3. The speech synthesis method of claim 2, wherein the method further comprises:
acquiring audio data of a user, and uploading the audio data to a third-party file cluster;
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
4. The method of claim 3, wherein training the corresponding speech model of the user according to the audio data to be stored in a speech synthesis server and generating corresponding routing information comprises:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
5. The method of claim 4, wherein selecting a speech synthesis server in the speech synthesis service cluster comprises:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
6. A speech synthesis apparatus, comprising:
the system comprises an acquisition service request module, a voice synthesis module and a voice synthesis module, wherein the acquisition service request module is used for acquiring a service request of a user, and the service request comprises a target text of voice to be synthesized;
the determining server module is used for determining a voice synthesis server for storing the voice model corresponding to the user in a voice synthesis service cluster according to the service request and the pre-stored routing information;
a speech synthesis module to:
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
7. The speech synthesis apparatus of claim 6, wherein the service request comprises a username and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
the determination server module is further to:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
8. The speech synthesis apparatus of claim 7, wherein the apparatus further comprises:
the data acquisition module is used for acquiring audio data of a user and uploading the audio data to a third-party file cluster;
a model training and storage module for:
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
9. The speech synthesis apparatus of claim 8, wherein the model training and storage module is further configured to:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
10. The speech synthesis apparatus of claim 8, wherein the model training and storage module is further configured to:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech synthesis method according to any of claims 1 to 5 are implemented when the program is executed by the processor.
12. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech synthesis method according to any one of claims 1 to 5.
CN202110496900.XA 2021-05-07 2021-05-07 Voice synthesis method and device, electronic equipment and storage medium Pending CN113160791A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110496900.XA CN113160791A (en) 2021-05-07 2021-05-07 Voice synthesis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110496900.XA CN113160791A (en) 2021-05-07 2021-05-07 Voice synthesis method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113160791A true CN113160791A (en) 2021-07-23

Family

ID=76873690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110496900.XA Pending CN113160791A (en) 2021-05-07 2021-05-07 Voice synthesis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113160791A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170092258A1 (en) * 2015-09-29 2017-03-30 Yandex Europe Ag Method and system for text-to-speech synthesis
CN110060656A (en) * 2019-05-05 2019-07-26 标贝(深圳)科技有限公司 Model management and phoneme synthesizing method, device and system and storage medium
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet
CN112185362A (en) * 2020-09-24 2021-01-05 苏州思必驰信息科技有限公司 Voice processing method and device for user personalized service
CN112270920A (en) * 2020-10-28 2021-01-26 北京百度网讯科技有限公司 Voice synthesis method and device, electronic equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170092258A1 (en) * 2015-09-29 2017-03-30 Yandex Europe Ag Method and system for text-to-speech synthesis
CN110060656A (en) * 2019-05-05 2019-07-26 标贝(深圳)科技有限公司 Model management and phoneme synthesizing method, device and system and storage medium
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet
CN112185362A (en) * 2020-09-24 2021-01-05 苏州思必驰信息科技有限公司 Voice processing method and device for user personalized service
CN112270920A (en) * 2020-10-28 2021-01-26 北京百度网讯科技有限公司 Voice synthesis method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
JP6799574B2 (en) Method and device for determining satisfaction with voice dialogue
CN108334540B (en) Media information display method and device, storage medium and electronic device
CN110136691B (en) Speech synthesis model training method and device, electronic equipment and storage medium
JP6786751B2 (en) Voice connection synthesis processing methods and equipment, computer equipment and computer programs
US8086457B2 (en) System and method for client voice building
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
WO2017059694A1 (en) Speech imitation method and device
CN110176237A (en) A kind of audio recognition method and device
CN102089804A (en) Voice synthesis model generation device, voice synthesis model generation system, communication terminal device and method for generating voice synthesis model
CN107341102A (en) A kind of test case file generation method and device
CN112614478B (en) Audio training data processing method, device, equipment and storage medium
CN107733722A (en) Method and apparatus for configuring voice service
CN109739605A (en) The method and apparatus for generating information
CN111667557A (en) Animation production method and device, storage medium and terminal
CN111968678B (en) Audio data processing method, device, equipment and readable storage medium
CN110148393B (en) Music generation method, device and system and data processing method
CN114373444B (en) Method, system and equipment for synthesizing voice based on montage
CN109710747B (en) Information processing method and device and electronic equipment
CN110600004A (en) Voice synthesis playing method and device and storage medium
CN113327576A (en) Speech synthesis method, apparatus, device and storage medium
CN110797001A (en) Method and device for generating voice audio of electronic book and readable storage medium
CN116737883A (en) Man-machine interaction method, device, equipment and storage medium
CN113160791A (en) Voice synthesis method and device, electronic equipment and storage medium
CN115690277A (en) Video generation method, system, device, electronic equipment and computer storage medium
KR101015975B1 (en) Method and system for generating RIA based character movie clip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 221, 2nd floor, Block C, 18 Kechuang 11th Street, Daxing Economic and Technological Development Zone, Beijing, 100176

Applicant after: Jingdong Technology Holding Co.,Ltd.

Address before: Room 221, 2nd floor, Block C, 18 Kechuang 11th Street, Daxing Economic and Technological Development Zone, Beijing, 100176

Applicant before: Jingdong Digital Technology Holding Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20210723

RJ01 Rejection of invention patent application after publication