CN113160791A - Voice synthesis method and device, electronic equipment and storage medium - Google Patents
Voice synthesis method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113160791A CN113160791A CN202110496900.XA CN202110496900A CN113160791A CN 113160791 A CN113160791 A CN 113160791A CN 202110496900 A CN202110496900 A CN 202110496900A CN 113160791 A CN113160791 A CN 113160791A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- user
- model
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 26
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 255
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 255
- 238000000034 method Methods 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims description 40
- 238000004590 computer program Methods 0.000 claims description 8
- 238000003062 neural network model Methods 0.000 claims description 8
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 abstract description 16
- 238000004519 manufacturing process Methods 0.000 abstract description 6
- 238000005516 engineering process Methods 0.000 description 9
- 238000010801 machine learning Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
Abstract
The invention provides a voice synthesis method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user. The method utilizes the specific voice model of the user to store the specific voice model in the appointed voice synthesis server in advance, when the user provides a voice synthesis service request, the corresponding voice synthesis server is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is low, and the calling process is simple.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.
Background
In recent years, artificial intelligence technology has been developed at a high speed, the man-machine interaction process has been redefined along with the technological progress, and the man-machine interaction means has been transformed from general mechanical instruction input to voice manipulation based on the existing voice recognition and voice synthesis technology. The voice synthesis technology synthesizes the generated response text into voice in the human-computer interaction process for playing, and the same text can synthesize different audios by using different types of tone models, so that much interest is added for voice synthesis.
At present, the timbre of the existing voice synthesis system is based on a deep learning technology, and the effect of repeated carving of the timbre is achieved by using a large number of pre-recorded audios of the same speaker for supervised training and simulating the tone, the phoneme, the sounding characteristics and other elements of the speaker. Part of the audio books are synthesized by voice, and the set timbre is selected to convert the text in the book into voice for playing, so that the traditional reading mode is replaced, and the vision burden of the user is lightened.
Although the existing speech synthesis method is mature, due to the limitation of the prior art, the cost is high for a common user to make a dedicated tone by himself, the calling process is complicated, and the server cannot provide speech synthesis service for a single user.
Disclosure of Invention
The invention provides a voice synthesis method and a voice synthesis device, which are used for solving the defects that in the prior art, the cost is high when a common user makes a dedicated tone by himself, and the calling process is complicated, and realizing the purpose of providing voice synthesis service for a single user.
The invention provides a speech synthesis method, which comprises the following steps:
acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized;
determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information;
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
According to the voice synthesis method provided by the invention, the service request comprises a user name and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
determining a speech synthesis server storing a speech model corresponding to the user in a speech synthesis service cluster according to the service request and pre-stored routing information, comprising:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
According to a speech synthesis method provided by the present invention, the method further comprises:
acquiring audio data of a user, and uploading the audio data to a third-party file cluster;
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
According to a speech synthesis method provided by the present invention, the training of the speech model corresponding to the user according to the audio data and storing the speech model in a speech synthesis server, and generating corresponding routing information includes:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
According to a speech synthesis method provided by the present invention, the selecting a speech synthesis server in a speech synthesis service cluster comprises:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
The present invention also provides a speech synthesis apparatus comprising:
the system comprises an acquisition service request module, a voice synthesis module and a voice synthesis module, wherein the acquisition service request module is used for acquiring a service request of a user, and the service request comprises a target text of voice to be synthesized;
the determining server module is used for determining a voice synthesis server for storing the voice model corresponding to the user in a voice synthesis service cluster according to the service request and the pre-stored routing information;
a speech synthesis module to:
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
According to a speech synthesis apparatus provided by the present invention, the service request includes a user name and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
the determination server module is further to:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
According to a speech synthesis apparatus provided by the present invention, the apparatus further comprises:
the data acquisition module is used for acquiring audio data of a user and uploading the audio data to a third-party file cluster;
a model training and storage module for:
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
According to the speech synthesis apparatus provided by the present invention, the model training and storing module is further configured to:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
According to the speech synthesis apparatus provided by the present invention, the model training and storing module is further configured to:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the speech synthesis method as described in any one of the above when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech synthesis method as described in any one of the above.
The voice synthesis method and the voice synthesis device provided by the invention have the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user provides a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a speech synthesis method provided by the present invention;
FIG. 2 is a schematic flow chart of a speech synthesis server for determining a speech model corresponding to a stored user according to the present invention;
FIG. 3 is a schematic flow chart of the present invention for training a specific speech model based on audio data of a user;
FIG. 4 is a schematic flow chart illustrating a process of storing a speech model corresponding to a trained user according to audio data in a speech synthesis server and generating corresponding routing information according to the present invention;
FIG. 5 is a schematic flow chart of selecting a speech synthesis server in a speech synthesis service cluster according to the present invention;
FIG. 6 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a speech synthesis method provided by the present invention, as shown in fig. 1, the method includes:
The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis.
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
The invention uses the routing technology of the special gateway to replace the database query technology commonly used in the prior art, so that the user request is quickly positioned to the appointed server.
The voice synthesis method provided by the invention has the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user puts forward a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple.
Fig. 2 is a schematic flowchart of a process for determining a speech synthesis server storing a speech model corresponding to a user according to an embodiment of the present invention, as shown in fig. 2, including:
the service request includes a username and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
It should be noted that, a user may have multiple speech models, and when the user makes a service request for speech synthesis, the server needs to obtain a specific speech model selected by the user, and since the speech model ID is a unique identifier of the speech model, route information should be synthesized according to the speech model ID to help quickly determine a storage location of the speech model selected by the current user, that is, a server corresponding to the service for speech synthesis.
The invention uses the routing list to store the mapping relation between the voice model and the voice synthesis server, so that the service request of the user can be quickly positioned to the appointed server, and the service efficiency is improved.
FIG. 3 is a flowchart illustrating a process of training a specific speech model based on audio data of a user according to an embodiment of the present invention, as shown in FIG. 3, including:
and step 310, acquiring the audio data of the user, and uploading the audio data to a third-party file cluster.
The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis. If the voice model of the current user is not stored in the system, the user needs to record through client software after logging in the system, the user uploads the audio data to an audio management service of a server after acquiring the audio data, and the audio management service uploads the audio data to a third-party file cluster such as an OSS for storage and backup.
In addition, the invention stipulates the speech technique used when the user inputs the audio data, make the user needn't record the training corpus for a long time, can extract the specialized characteristic of the user's pronunciation, have improved the similarity that the tone is customized, because the training corpus quantity is smaller, make training time shorten greatly, have raised the working efficiency of the whole speech synthesis training.
The audio management service starts an asynchronous command, informs the deep machine learning service to train a voice model of a user, receives the command, and downloads audio data from the third-party file cluster to prepare for model training.
The special characteristics of the user tone can be extracted from the trained voice model, the voice model is stored in the appointed voice synthesis server, and the target text is converted into the voice with the user tone characteristics when the voice synthesis service request of the user is received.
Fig. 4 is a schematic flowchart illustrating a process of training a speech model corresponding to a user according to audio data, storing the speech model in a speech synthesis server, and generating corresponding routing information, according to an embodiment of the present invention, as shown in fig. 4, including:
and step 410, training a voice model corresponding to the user according to the audio data based on a neural network model.
The deep machine learning service is based on a neural network model, extracts the special characteristics of the user according to a small amount of audio data recorded by the user, and combines the special characteristics with a voice model which is trained in advance by using general data to obtain the special model of the specific user.
And uploading the trained proprietary speech model to a third-party file cluster by the deep machine learning service.
The deep machine learning service sends asynchronous messages, the messages of the completion of the training of the voice model are notified to the voice synthesis management service, the voice synthesis management service downloads the current voice model from a third-party file cluster, a voice synthesis server is selected from the voice synthesis service cluster, and the voice synthesis server is notified to load the voice model.
The specific voice synthesis server can be selected according to attributes such as the number of voice models stored in the server and the types of users to which the models belong, or the selection conditions of the voice synthesis server can be set according to the target in the current application scene, so that the efficiency of the voice synthesis service is improved.
And step 450, synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
The invention appoints the server for different users, and realizes the provision of voice synthesis service for a single user.
Fig. 5 is a flowchart illustrating a process for selecting a speech synthesis server in a speech synthesis service cluster according to an embodiment of the present invention, as shown in fig. 5, including:
And selecting the speech synthesis server with the least number of models to store the newly trained speech models, balancing the load of the speech synthesis service cluster, improving the processing efficiency of the speech synthesis service and improving the use experience of users.
A speech synthesis process in an application scenario in the embodiment of the present invention is described below.
The application scene is set to customize the parent-child talking book, the user only needs to record a plurality of sentences of audio, and the trained tone can be added into the alternative tone of the talking book.
Step 1, a user registers and logs in a server through client software, a tone library is set in the system, and the tone library comprises common tones for customizing parent-child talking books.
And 2, selecting a preset tone by the user, or recording the audio to create a tone set of the user.
And 3, uploading the books and the text fragments by the user, or selecting the common text in the system text library as the target text of the voice to be synthesized.
And 4, selecting a specific tone from the tone collection by the user, and starting to customize the audio book.
And 5, after the audio book is customized, the user can browse and play the audio book in the audio book collection.
And 6, the user can create an alternative tone color set for the specific audio book, namely, the same text corresponds to different tone colors for the user to select to play.
The voice synthesis method provided by the invention has the advantages that the specific voice model of the user is utilized, the specific voice model is stored in the appointed voice synthesis server in advance, the routing information is generated on the basis of the voice model and the voice synthesis server, when the user puts forward a voice synthesis service request, the voice synthesis server corresponding to the user is directly found through the routing information to provide service for the user, the voice synthesis service is provided for a single user, the manufacturing cost is lower, and the calling process is simple. Meanwhile, due to the fact that the user input voice frequency data is specified, the special characteristics of the user voice can be extracted according to a small amount of voice frequency data, the similarity of tone customization is improved, the time of voice frequency recording and expected training is shortened, and the overall efficiency of voice synthesis service is improved. In addition, due to the convenience of voice synthesis for specific timbres, the user can select more timbres for synthesis besides preset timbres, and the use experience of the user is improved. Moreover, the service efficiency is improved on the whole for the server with the highest synthesis efficiency to finish the speech synthesis service without using the speech model.
The following describes a speech synthesis apparatus provided by the present invention, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to correspondingly.
Fig. 6 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention, as shown in fig. 6, the apparatus includes:
an obtaining service request module 610, configured to obtain a service request of a user, where the service request includes a target text of a speech to be synthesized.
The user registers the login server through the client software to acquire the related authority of the system. If the voice model of the user is stored in the current system before, the user only needs to upload the target text of the voice to be synthesized after logging in the system, or selects the common text pre-stored in the system for voice synthesis.
And the determining server module 620 is configured to determine, according to the service request and the pre-stored routing information, a speech synthesis server storing the speech model corresponding to the user in a speech synthesis service cluster.
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
The invention uses the routing technology of the special gateway to replace the database query technology commonly used in the prior art, so that the user request is quickly positioned to the appointed server.
A speech synthesis module 630 configured to:
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
The voice synthesis device provided by the invention stores the specific voice model of the user in the specified voice synthesis server in advance, generates the routing information based on the voice model and the voice synthesis server, directly finds the voice synthesis server corresponding to the user through the routing information to provide service for the user when the user provides a service request of voice synthesis, realizes the purpose of providing voice synthesis service for a single user, and has lower manufacturing cost and simple calling process.
According to one embodiment of the invention, the service request includes a username and a voice model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
the determination server module 620 is further configured to:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
The server stores the routing information of the voice synthesis server storing the voice model of the specific user in the routing gateway, and when the routing gateway receives the service request of the user, the routing information of the voice synthesis server corresponding to the current user can be searched in the routing table.
It should be noted that, a user may have multiple speech models, and when the user makes a service request for speech synthesis, the server needs to obtain a specific speech model selected by the user, and since the speech model ID is a unique identifier of the speech model, route information should be synthesized according to the speech model ID to help quickly determine a storage location of the speech model selected by the current user, that is, a server corresponding to the service for speech synthesis.
The invention uses the routing list to store the mapping relation between the voice model and the voice synthesis server, so that the service request of the user can be quickly positioned to the appointed server, and the service efficiency is improved.
According to an embodiment of the invention, the apparatus further comprises:
and the data acquisition module is used for acquiring the audio data of the user and uploading the audio data to the third-party file cluster.
If the voice model of the current user is not stored in the system, the user needs to record through client software after logging in the system, the user uploads the audio data to an audio management service of a server after acquiring the audio data, and the audio management service uploads the audio data to a third-party file cluster such as an OSS for storage and backup.
In addition, the invention stipulates the speech technique used when the user inputs the audio data, make the user needn't record the training corpus for a long time, can extract the specialized characteristic of the user's pronunciation, have improved the similarity that the tone is customized, because the training corpus quantity is smaller, make training time shorten greatly, have raised the working efficiency of the whole speech synthesis training.
A model training and storage module for:
downloading the audio data from the third party file cluster;
and storing the voice model corresponding to the user to a voice synthesis server according to the audio data, and generating corresponding routing information.
Specifically, the audio management service starts an asynchronous command, informs the deep machine learning service to train a voice model of the user, and the deep machine learning service receives the command and downloads audio data from the third-party file cluster to prepare for model training.
The special characteristics of the user tone can be extracted from the trained voice model, the voice model is stored in the appointed voice synthesis server, and the target text is converted into the voice with the user tone characteristics when the voice synthesis service request of the user is received.
According to an embodiment of the present invention, the model training and storage module is further configured to:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
Specifically, the deep machine learning service is based on a neural network model, extracts the special characteristics of the user according to a small amount of audio data recorded by the user, and combines the special characteristics with a voice model trained in advance by using general data to obtain the special model of the specific user. And uploading the trained proprietary speech model to a third-party file cluster by the deep machine learning service.
The deep machine learning service sends asynchronous messages, the messages of the completion of the training of the voice model are notified to the voice synthesis management service, the voice synthesis management service downloads the current voice model from a third-party file cluster, a voice synthesis server is selected from the voice synthesis service cluster, and the voice synthesis server is notified to load the voice model.
The specific voice synthesis server can be selected according to attributes such as the number of voice models stored in the server and the types of users to which the models belong, or the selection conditions of the voice synthesis server can be set according to the target in the current application scene, so that the efficiency of the voice synthesis service is improved.
The invention appoints the server for different users, and realizes the provision of voice synthesis service for a single user.
According to an embodiment of the present invention, the model training and storage module is further configured to:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
And selecting the speech synthesis server with the least number of models to store the newly trained speech models, balancing the load of the speech synthesis service cluster, improving the processing efficiency of the speech synthesis service and improving the use experience of users.
The voice synthesis device provided by the invention stores the specific voice model of the user in the specified voice synthesis server in advance, generates the routing information based on the voice model and the voice synthesis server, directly finds the voice synthesis server corresponding to the user through the routing information to provide service for the user when the user provides a service request of voice synthesis, realizes the purpose of providing voice synthesis service for a single user, and has lower manufacturing cost and simple calling process. Meanwhile, due to the fact that the user input voice frequency data is specified, the special characteristics of the user voice can be extracted according to a small amount of voice frequency data, the similarity of tone customization is improved, the time of voice frequency recording and expected training is shortened, and the overall efficiency of voice synthesis service is improved. In addition, due to the convenience of voice synthesis for specific timbres, the user can select more timbres for synthesis besides preset timbres, and the use experience of the user is improved. Moreover, the service efficiency is improved on the whole for the server with the highest synthesis efficiency to finish the speech synthesis service without using the speech model.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a speech synthesis method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a speech synthesis method provided by the above methods, the method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform a speech synthesis method provided by the above methods, the method comprising: acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized; determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information; forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech; returning the speech to the user.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (12)
1. A method of speech synthesis, comprising:
acquiring a service request of a user, wherein the service request comprises a target text of a voice to be synthesized;
determining a voice synthesis server for storing a voice model corresponding to the user in a voice synthesis service cluster according to the service request and pre-stored routing information;
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
2. The speech synthesis method of claim 1, wherein the service request includes a username and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
determining a speech synthesis server storing a speech model corresponding to the user in a speech synthesis service cluster according to the service request and pre-stored routing information, comprising:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
3. The speech synthesis method of claim 2, wherein the method further comprises:
acquiring audio data of a user, and uploading the audio data to a third-party file cluster;
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
4. The method of claim 3, wherein training the corresponding speech model of the user according to the audio data to be stored in a speech synthesis server and generating corresponding routing information comprises:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
5. The method of claim 4, wherein selecting a speech synthesis server in the speech synthesis service cluster comprises:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
6. A speech synthesis apparatus, comprising:
the system comprises an acquisition service request module, a voice synthesis module and a voice synthesis module, wherein the acquisition service request module is used for acquiring a service request of a user, and the service request comprises a target text of voice to be synthesized;
the determining server module is used for determining a voice synthesis server for storing the voice model corresponding to the user in a voice synthesis service cluster according to the service request and the pre-stored routing information;
a speech synthesis module to:
forwarding the service request to the speech synthesis server to cause the speech synthesis server to synthesize the target text into speech;
returning the speech to the user.
7. The speech synthesis apparatus of claim 6, wherein the service request comprises a username and a speech model ID; the routing information comprises a voice model ID and a voice synthesis server ID;
the determination server module is further to:
searching the voice model ID in the pre-stored routing information, and acquiring a voice synthesis server ID corresponding to the voice model ID;
and determining a voice synthesis server corresponding to the user in a voice synthesis service cluster according to the ID of the voice synthesis server.
8. The speech synthesis apparatus of claim 7, wherein the apparatus further comprises:
the data acquisition module is used for acquiring audio data of a user and uploading the audio data to a third-party file cluster;
a model training and storage module for:
downloading the audio data from the third party file cluster;
and training the voice model corresponding to the user according to the audio data, storing the voice model into a voice synthesis server, and generating corresponding routing information.
9. The speech synthesis apparatus of claim 8, wherein the model training and storage module is further configured to:
training a voice model corresponding to the user according to the audio data based on a neural network model;
uploading the speech model to the third-party file cluster;
selecting a speech synthesis server in a speech synthesis service cluster;
downloading the voice model from the third-party file cluster and storing the voice model to the voice synthesis server;
and synthesizing the voice model ID and the voice synthesis server ID into the routing information and storing the routing information.
10. The speech synthesis apparatus of claim 8, wherein the model training and storage module is further configured to:
counting the number of voice models of each voice synthesis server in the voice synthesis service cluster;
and selecting the speech synthesis server with the least number of speech models to download the user speech models.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech synthesis method according to any of claims 1 to 5 are implemented when the program is executed by the processor.
12. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech synthesis method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110496900.XA CN113160791A (en) | 2021-05-07 | 2021-05-07 | Voice synthesis method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110496900.XA CN113160791A (en) | 2021-05-07 | 2021-05-07 | Voice synthesis method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113160791A true CN113160791A (en) | 2021-07-23 |
Family
ID=76873690
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110496900.XA Pending CN113160791A (en) | 2021-05-07 | 2021-05-07 | Voice synthesis method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113160791A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170092258A1 (en) * | 2015-09-29 | 2017-03-30 | Yandex Europe Ag | Method and system for text-to-speech synthesis |
CN110060656A (en) * | 2019-05-05 | 2019-07-26 | 标贝(深圳)科技有限公司 | Model management and phoneme synthesizing method, device and system and storage medium |
CN110751940A (en) * | 2019-09-16 | 2020-02-04 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and computer storage medium for generating voice packet |
CN112185362A (en) * | 2020-09-24 | 2021-01-05 | 苏州思必驰信息科技有限公司 | Voice processing method and device for user personalized service |
CN112270920A (en) * | 2020-10-28 | 2021-01-26 | 北京百度网讯科技有限公司 | Voice synthesis method and device, electronic equipment and readable storage medium |
-
2021
- 2021-05-07 CN CN202110496900.XA patent/CN113160791A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170092258A1 (en) * | 2015-09-29 | 2017-03-30 | Yandex Europe Ag | Method and system for text-to-speech synthesis |
CN110060656A (en) * | 2019-05-05 | 2019-07-26 | 标贝(深圳)科技有限公司 | Model management and phoneme synthesizing method, device and system and storage medium |
CN110751940A (en) * | 2019-09-16 | 2020-02-04 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and computer storage medium for generating voice packet |
CN112185362A (en) * | 2020-09-24 | 2021-01-05 | 苏州思必驰信息科技有限公司 | Voice processing method and device for user personalized service |
CN112270920A (en) * | 2020-10-28 | 2021-01-26 | 北京百度网讯科技有限公司 | Voice synthesis method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6799574B2 (en) | Method and device for determining satisfaction with voice dialogue | |
CN108334540B (en) | Media information display method and device, storage medium and electronic device | |
CN110136691B (en) | Speech synthesis model training method and device, electronic equipment and storage medium | |
JP6786751B2 (en) | Voice connection synthesis processing methods and equipment, computer equipment and computer programs | |
US8086457B2 (en) | System and method for client voice building | |
CN111048064B (en) | Voice cloning method and device based on single speaker voice synthesis data set | |
WO2017059694A1 (en) | Speech imitation method and device | |
CN110176237A (en) | A kind of audio recognition method and device | |
CN102089804A (en) | Voice synthesis model generation device, voice synthesis model generation system, communication terminal device and method for generating voice synthesis model | |
CN107341102A (en) | A kind of test case file generation method and device | |
CN112614478B (en) | Audio training data processing method, device, equipment and storage medium | |
CN107733722A (en) | Method and apparatus for configuring voice service | |
CN109739605A (en) | The method and apparatus for generating information | |
CN111667557A (en) | Animation production method and device, storage medium and terminal | |
CN111968678B (en) | Audio data processing method, device, equipment and readable storage medium | |
CN110148393B (en) | Music generation method, device and system and data processing method | |
CN114373444B (en) | Method, system and equipment for synthesizing voice based on montage | |
CN109710747B (en) | Information processing method and device and electronic equipment | |
CN110600004A (en) | Voice synthesis playing method and device and storage medium | |
CN113327576A (en) | Speech synthesis method, apparatus, device and storage medium | |
CN110797001A (en) | Method and device for generating voice audio of electronic book and readable storage medium | |
CN116737883A (en) | Man-machine interaction method, device, equipment and storage medium | |
CN113160791A (en) | Voice synthesis method and device, electronic equipment and storage medium | |
CN115690277A (en) | Video generation method, system, device, electronic equipment and computer storage medium | |
KR101015975B1 (en) | Method and system for generating RIA based character movie clip |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 221, 2nd floor, Block C, 18 Kechuang 11th Street, Daxing Economic and Technological Development Zone, Beijing, 100176 Applicant after: Jingdong Technology Holding Co.,Ltd. Address before: Room 221, 2nd floor, Block C, 18 Kechuang 11th Street, Daxing Economic and Technological Development Zone, Beijing, 100176 Applicant before: Jingdong Digital Technology Holding Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210723 |
|
RJ01 | Rejection of invention patent application after publication |