CN113241057A

CN113241057A - Interactive method, apparatus, system and medium for speech synthesis model training

Info

Publication number: CN113241057A
Application number: CN202110452288.6A
Authority: CN
Inventors: 胡帅君; 边会康; 李世龙; 李秀林
Original assignee: Databaker Beijng Technology Co ltd
Current assignee: Databaker Beijng Technology Co ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-08-10
Anticipated expiration: 2041-04-26
Also published as: CN113241057B

Abstract

The invention provides an interactive method, an interactive device, an interactive system and a storage medium for realizing personalized speech synthesis model training. The method comprises the following steps: acquiring a user training text from a repeated carving service server; outputting a user training text; collecting the voice of a target user to obtain a user recording file; under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, the user recording file is uploaded to a model training server directly or through a repeated engraving service server, so that a personalized voice synthesis model special for a target user is trained on the model training server based on the user recording file; receiving training result information of the personalized speech synthesis model from the model training server directly or via a duplication service server; feedback information on whether training of the personalized speech synthesis model is completed is output based on the training result information. The client (or a target application on the client) is enabled to support voice duplication.

Description

Interactive method, apparatus, system and medium for speech synthesis model training

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to an interactive method, an interactive device, an interactive system, and a storage medium for implementing personalized speech synthesis model training.

Background

The speech synthesis technology is a technology for converting text information into voice information. Speech synthesis technology can provide speech synthesis services for a wide range of users and target applications. With the development of the technology, the personalized voice synthesis technology is mature day by day, and users can clone own exclusive voice according to personal preferences. For example, in some application scenarios, such as applications of a child story machine, a map navigation report, etc., it is desirable to customize the user's own voice for voice broadcast. In order to implement such a technique, it is generally necessary to collect the voice of the user, and train a speech synthesis model based on the provided text and the collected voice of the user, so as to obtain a personalized speech synthesis model specific to the user.

Modules that facilitate interaction between a client installed with a target application and a cloud server (e.g., a copy services server and/or a model training server, etc.) to support training for personalized speech synthesis models are currently lacking.

Disclosure of Invention

To at least partially solve the problems in the prior art, an interactive method, apparatus, system, and storage medium for implementing personalized speech synthesis model training are provided.

According to one aspect of the invention, an interactive method for implementing personalized speech synthesis model training is provided, comprising client operations including: acquiring a user training text from a repeated carving service server; outputting a user training text; collecting the voice of a target user to obtain a user recording file; under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, the user recording file is uploaded to a model training server directly or through a repeated engraving service server, so that a personalized voice synthesis model special for a target user is trained on the model training server based on the user recording file; receiving training result information of the personalized speech synthesis model from the model training server directly or via a duplication service server; feedback information on whether training of the personalized speech synthesis model is completed is output based on the training result information.

According to another aspect of the present invention, there is also provided an interactive apparatus for implementing personalized speech synthesis model training, including a client-side operation module, the client-side operation module including: the obtaining submodule is used for obtaining a user training text from the repeated carving service server; the first output submodule is used for outputting a user training text; the acquisition submodule is used for acquiring the voice of a target user to obtain a user recording file; the uploading sub-module is used for uploading the user recording file to the model training server directly or through a repeated engraving service server under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, so that the personalized voice synthesis model specially belonging to the target user is trained on the model training server based on the user recording file; the receiving submodule is used for receiving the training result information of the personalized speech synthesis model from the model training server directly or through the repeated carving service server; and the second output submodule is used for outputting feedback information about whether the training of the personalized speech synthesis model is finished or not based on the training result information.

According to another aspect of the present invention, there is also provided an interactive system for implementing personalized speech synthesis model training, comprising a processor and a memory, wherein the memory stores computer program instructions, and the computer program instructions are used for executing the above interactive method for implementing personalized speech synthesis model training when being executed by the processor.

According to another aspect of the present invention, there is also provided a storage medium having stored thereon program instructions for executing the above-mentioned interactive method for implementing personalized speech synthesis model training when executed.

According to one aspect of the present invention, an interactive method for implementing personalized speech synthesis model training is provided, including server-side operations, the server-side operations including: sending the user training text to a repeated carving client; receiving a user recording file from a re-engraving client; under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, the user recording file is uploaded to a model training server, so that a personalized voice synthesis model which is special for a target user is trained on the model training server based on the user recording file; receiving training result information of the personalized speech synthesis model from the model training server; and sending the training result information to the repeated carving client.

According to another aspect of the present invention, there is also provided an interactive apparatus for implementing personalized speech synthesis model training, including a server-side operation module, where the server-side operation module includes: the first sending submodule is used for sending the user training text to the repeated engraving client; the first receiving submodule is used for receiving the user recording file from the repeated carving client; the uploading sub-module is used for uploading the user recording file to the model training server under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, so that the personalized voice synthesis model specially belonging to the target user is trained on the model training server based on the user recording file; the second receiving submodule is used for receiving the training result information of the personalized speech synthesis model from the model training server; and the second sending submodule is used for sending the training result information to the repeated carving client.

According to the interactive method, the interactive device, the interactive system and the storage medium for realizing the training of the personalized speech synthesis model, the training text of the user can be automatically acquired from the server, the recording file of the user can be automatically uploaded, and the training of the personalized speech synthesis model can be realized through the server. This interaction scheme enables the client (or a target application on the client) to support voice duplication. In addition, the scheme is helpful for supporting cloud automation training of the personalized speech synthesis model.

A series of concepts in a simplified form are introduced in the summary of the invention, which is described in further detail in the detailed description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

The advantages and features of the present invention are described in detail below with reference to the accompanying drawings.

Drawings

The following drawings of the invention are included to provide a further understanding of the invention. The drawings illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings, there is shown in the drawings,

FIG. 1 shows a schematic diagram of a sound reproduction procedure according to an embodiment of the invention;

FIG. 2 shows a schematic flow diagram of an interactive method for implementing personalized speech synthesis model training, according to one embodiment of the present invention;

3a-3b illustrate exemplary operation of a replicated SDK integrated in a target application on a replicated client;

FIG. 4 shows a schematic flow diagram of an interactive method for implementing personalized speech synthesis model training, according to one embodiment of the present invention;

FIG. 5 shows a schematic flow diagram of a method of training a personalized speech synthesis model according to one embodiment of the invention;

FIG. 6 illustrates a flow diagram of automated training of speech synthesis models and speech synthesis according to one embodiment of the present invention;

FIG. 7 shows a schematic flow diagram of a speech synthesis method according to one embodiment of the invention;

FIG. 8 shows a schematic block diagram of an interactive apparatus for implementing personalized speech synthesis model training according to one embodiment of the present invention;

FIG. 9 shows a schematic block diagram of an interactive system for implementing personalized speech synthesis model training, according to one embodiment of the present invention;

FIG. 10 shows a schematic block diagram of an interactive apparatus for implementing personalized speech synthesis model training according to one embodiment of the present invention;

FIG. 11 shows a schematic block diagram of an interactive system for implementing personalized speech synthesis model training, according to one embodiment of the present invention;

FIG. 12 shows a schematic block diagram of a training apparatus for personalized speech synthesis models according to one embodiment of the present invention;

FIG. 13 shows a schematic block diagram of a training system for personalized speech synthesis models, according to one embodiment of the present invention;

FIG. 14 shows a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present invention; and

FIG. 15 shows a schematic block diagram of a speech synthesis system according to one embodiment of the present invention.

Detailed Description

In the following description, numerous details are provided to provide a thorough understanding of the present invention. One skilled in the art, however, will understand that the following description merely illustrates a preferred embodiment of the invention and that the invention may be practiced without one or more of these details. In other instances, well known features have not been described in detail so as not to obscure the invention.

In order to at least partially solve the technical problem, embodiments of the present invention provide an interactive method, apparatus, system and storage medium for implementing personalized speech synthesis model training. The method can conveniently realize the interaction between the client side provided with the target application and the cloud server, and further finally train and obtain the needed personalized voice synthesis model of the target user. The interactive technology for realizing personalized speech synthesis model training according to the embodiment of the invention can be applied to any field adopting speech synthesis technology.

It will be understood by those skilled in the art that voice replication techniques refer to techniques whereby, for any user, a personalized speech synthesis model specific to that user can be generated. Then, when the exclusive voice of the user needs to be synthesized, the characters to be synthesized can be input into the personalized voice synthesis model for processing, and finally, the synthesized voice content is consistent with the input characters and the tone color, rhythm and the like of the voice are consistent with the voice of the user.

The voice replication technology described herein is implemented based on cloud services, and involves interaction between multiple clients and servers. Fig. 1 shows a schematic diagram of a sound reproduction procedure according to an embodiment of the invention. As shown in fig. 1, the whole voice duplication process can be roughly divided into a voice acquisition stage, an audio detection and model training stage, and a speech synthesis stage. Optionally, the sound collection may be mainly performed at the client, the audio detection and the model training may be mainly performed at the cloud (e.g., the copy service server and/or the model training server), and the speech synthesis may be mainly performed at the same or another cloud (e.g., the synthesis service server). It should be understood that the division of the sound reproduction stages and the location (client or server) for implementing the stages are only examples and are not limiting to the present invention, and the sound reproduction is not necessarily limited to the flow shown in fig. 1 and the implementation location described, which can be understood in conjunction with the detailed description below.

It will be appreciated that prior to speech synthesis, a personalized speech synthesis model of the obtaining user (referred to herein as the target user) may be trained first. Model training involves a client and a cloud server. The client may comprise, for example, a copy client. The copy client described herein may be a terminal device installed with a target application. The target application may be any application that requires a voice copy function, such as a certain map navigation application, a certain book reader application, and so on. The target user may be any user using the target application described above. The terminal device may be a personal computer, a smart phone, a tablet computer, or some kind of server device, etc. Technical developers who repeatedly carve the sound can pack the repeatedly carving function of the sound into a software module in the form of a Software Development Kit (SDK) or a web page programming language (such as JavaScript language, JS for short). The target application can embed the voice copy function into its own application by downloading the SDK from an address provided by the voice facilitator (who provides the voice copy technology) or accessing a browser address provided by the voice facilitator to load the JS language. The target application may provide an interface and/or an interface for interacting with a user to receive various instructions input by the user and perform corresponding operations. When a user uses a target application, the implanted voice copy technology may be used with an interface provided by the target application and/or the interface. The copy-on-copy SDK may be installed in a client (e.g., a mobile phone of a user) along with the target application, and the target application may communicate with the cloud server not directly but through the copy-on-copy SDK.

The following describes a scheme for implementing training of a personalized speech synthesis model by a client installed with the target application and a server interacting with the client.

In accordance with one aspect of the present invention, an interactive method for implementing personalized speech synthesis model training is disclosed. FIG. 2 shows a schematic flow diagram of an interactive method 200 for implementing personalized speech synthesis model training, according to one embodiment of the present invention. The interaction method 200 is applied to a client (hereinafter, referred to as a copy client) installed with the above-described target application. As shown in FIG. 2, an interactive method 200 for implementing personalized speech synthesis model training includes client operations including steps S210-S260.

In step S210, a user training text is obtained from the duplication service server.

The copy client or a target application on the copy client (in the description herein, the operation of the copy client may be understood as the operation of the target application on the copy client) may interact with the copy service server to obtain the user training text from the copy service server. The user training text may be any text. Optionally, the same user training text may be allocated to different target applications, or different user training texts may be allocated to different target applications. That is, the user training text obtained by the duplication client from the duplication service server may be the user training text associated with the target application integrating the voice duplication function. For example, for a map navigation type target application, the copy service server may assign a user training text containing more cities and road names. For another example, for a reading-type target application, the copy service server may assign a text in a certain novel to the reading-type target application as a user training text.

Alternatively, the user training text may be decorrelated from the user, i.e. the received user training text is the same for different users of the same application. Of course, alternatively, the user training texts may also be related to the user, i.e. different user training texts may be assigned according to different users.

Optionally, when the copy client requests the user training text from the copy service server, the copy service server may return multiple sets of training texts to the copy client, and the target application may select one set of training text from the multiple sets of training texts to provide the selected set of training text to the target user.

In step S220, the user training text is output. The output user training text may be available for viewing by the target user.

The user training text may be output via an output device of the copy client. Illustratively, the output device may include, but is not limited to, a display screen and/or a speaker, etc. The user training text can be displayed on the display screen in the form of characters, images or videos, can be output by the loudspeaker in the form of audio, and can be output in any combination of the forms of characters, images, videos, audios and the like. The manner in which the user training text is output in audio form may be convenient for users with poor eyesight or who are otherwise inconvenient to view the screen.

In step S230, the voice of the target user is collected to obtain a user recording file.

When the user training text is output for the target user to check, the voice of the target user can be collected, and the user recording file is obtained. Alternatively, a predetermined time after the user training text is output may be used as a voice collecting time, and the voice of the user is collected within the voice collecting time to obtain the user recording file. Alternatively, the copy-back client may receive a recording start instruction and a recording completion instruction input by the user, and use voice collected from the time when the recording start instruction is received to the time when the recording completion instruction is received as the required user recording file.

The target application may invoke a microphone of the copy client through which the speech of the target user is collected. The manner in which the microphones are called and picked up is understood by those skilled in the art and will not be described in detail herein.

In step S240, under the condition that the text information included in the user training text matches the text information expressed by the user audio file, the user audio file is uploaded to the model training server directly or via the copy service server, so as to train the personalized speech synthesis model specifically belonging to the target user on the model training server based on the user audio file.

Before uploading the user audio file to the model training server directly or via the duplication service server, or in the process of uploading the user audio file to the model training server directly or via the duplication service server, a voice text matching operation may be performed by the duplication client or the duplication service server, the voice text matching operation including: and judging whether the character information contained in the training text of the user is matched with the character information expressed by the recording file of the user. Preferably, the voice text matching operation is performed by the duplication service server.

Exemplary implementations of the speech-to-text matching operation will be described in detail below, and are not described in detail herein.

Under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, the complete user recording file can be uploaded to the model training server. The model training server may perform training of the speech synthesis model.

In one example, while uploading the user's voice recording file to the model training server directly or via the copy services server, the copy client may also upload the user's training text to the model training server together. The model training server may perform training of the speech synthesis model based on the user sound files and the user training text. In another example, the model training server may perform speech recognition on the user's voice recording file, identify corresponding text, prosody, user tags, etc. information therefrom, and perform training of the speech synthesis model based on the user's voice recording file and the speech recognition results. In this case, uploading of user training text to the model training server may not be required.

In step S250, training result information of the personalized speech synthesis model is received from the model training server directly or via the duplication service server.

In step S260, feedback information on whether training of the personalized speech synthesis model is completed is output based on the training result information.

The feedback information may be output through an output device of the duplication client. Similar to the user training text, the feedback information may be displayed on the display screen in the form of characters, images or videos, may be output by a speaker in the form of audio, and may be output in any combination of the forms of characters, images, videos, audios, and the like.

According to the interactive method for realizing the training of the personalized speech synthesis model, the user training text can be automatically obtained from the server, the user recording file is automatically uploaded, and the training of the personalized speech synthesis model is realized through the server. This interaction scheme enables the client (or a target application on the client) to support voice duplication. In addition, the scheme is helpful for supporting cloud automation training of the personalized speech synthesis model.

One or more of the model training server, the resculpting business server, and the authorization server need to interact with the resculpting client, specifically, interact with the resculpting SDK (or loaded JS language module) integrated in the target application on the resculpting client. For convenience of description of subsequent contents, the interaction flow between the copy-on client and the server will be mainly described below by taking the copy-on SDK as an example.

Referring to fig. 3a-3b, exemplary operations of the replicated SDK are shown. Below the dotted line is Demo, which is the information perceived or the operation performed by the target application. Above the dotted line is the operation actually performed by the double-etched SDK in the background. The operations performed by the replicated SDK mainly include noise detection and recording. The noise detection operation is optional.

First, the target application may interact with the target user and receive a repeat-start instruction input by the target user. The target application may first initialize the redraw SDK after receiving the redraw enable instruction. For a double-etched SDK, it initializes in the background. After the initialization is completed, the copy-on SDK may upload the client identifier (ClientId) and the client password (ClientSecret) of the target application to the authorization server for authentication, and if the authentication is passed, a copy-on token (token) returned by the authorization server may be received.

After the developer of the target application purchases the use right of the repeated SDK, the voice service provider can send the ClientId and the ClientSecret to the target application through a mail, a short message or other modes. The ClientId and ClientSecret are used to verify the identity of the target application and determine whether it has the right to use the voice copy service. Different target applications can have different rights, and the authorization server can store the rights content of each target application and verify the rights of the target application when the target application applies for using the voice copy service. Alternatively, the authorization server may employ an authorization protocol such as Oauth2.0 Client creatives authorization protocol to open authorization capabilities and may authorize towards the application.

After the resculpting SDK is taken to the resculpting token, the resculpting token can be used as a legal certificate to interact with the server, for example, a resculpting service server can apply for obtaining a user training text, a resculpting service server or a model training server can apply for model training resources, and the like.

The resculpting SDK may first obtain the user training text (shown as recorded text in fig. 3a-3 b) from the resculpting service server through a resculpting token. In addition, the double-etched SDK can also perform noise detection asynchronously with the acquisition of the user training text.

After the noise detection is passed and under the condition that the training text of the user is obtained, the SDK can start formal recording after repeated carving. The user training text may be output at the copy client at this time. Meanwhile, the target user may make a corresponding sound according to the user training text, and the repeated-carving SDK collects the voice of the target user at that time through a microphone to obtain a user recording file (shown as a sound file in fig. 3a-3 b).

The repeated-engraving SDK uploads the acquired user recording files to the model training server directly or through the repeated-engraving service server, and the model training server can perform model training based on the user recording files to obtain the required personalized voice synthesis model. The model training server can persistently store the personalized speech synthesis model obtained by training so as to call out the corresponding personalized speech synthesis model and synthesize the personalized speech synthesis model when speech synthesis needs to be performed for a target user.

As can be seen from the above description, the whole voice copy process involves more interactions between the client and the server. The interactive method 200 is primarily a copy of the client-side operations. Some specific implementations of the interaction method 200 are described below.

According to the embodiment of the invention, the user training text comprises at least one text segment, the user recording file comprises at least one qualified voice segment file corresponding to the at least one text segment,

collecting the voice of the target user to obtain the user recording file comprises:

step a: determining a first text segment in a user training text as a current text segment;

step b: collecting the voice of a target user to obtain a temporary voice section file corresponding to the current text section;

step c: under the condition that the character information contained in the current text segment is matched with the character information expressed by the temporary speech segment file, determining that the temporary speech segment file is a qualified speech segment file corresponding to the current text segment and judging whether the current text segment is the last text segment, if the current text segment is not the last text segment, determining that the next text segment is a new current text segment and turning to the step b, and if the current text segment is the last text segment, determining that the recording file of the user is completely acquired;

step d: and under the condition that the text information contained in the current text segment is not matched with the text information expressed by the temporary voice segment file, if the first preset condition is met, discarding the temporary voice segment file and turning to the step b, and if the first preset condition is not met, outputting the recording failure prompt information.

And for any current text segment, if the character information expressed by the acquired temporary voice segment file is not matched with the character information contained in the current text segment, discarding the temporary voice segment file until a matched temporary voice segment file is found, and determining the matched temporary voice segment file as the required qualified voice segment file. And all the qualified voice segment files corresponding to all the text segments form a user recording file.

Optionally, in step d, under the condition that the text information included in the current text segment does not match the text information expressed by the temporary speech segment file, if the first preset condition is met, the re-recording prompt information may be further output for the target user to check, so as to prompt the target user to re-record. If the first preset condition is not met, outputting recording failure prompt information for the target user to check.

The word "at least one" is equivalent to "one or more" as described herein. Illustratively, the copy-on-copy client may upload the user's voice recording file directly to the model training server. In one embodiment, the user's voice recording file may be uploaded to the model training server at one time. In another embodiment, the user's voice recording file may be uploaded to the model training server separately in multiple times. In the latter embodiment, the user audio file may be uploaded while being detected. For example, the user audio file may be divided into multiple segments, and after each segment is collected and uploaded, the model training server may detect whether the segment of audio file (i.e., the voice segment file) matches with the corresponding training text (i.e., the text segment), and continue to collect and upload the next segment if the match is found. For example, the training text of the user can be divided into 12 segments, each segment has 50 words, and the total number of the words is 600 words, the current 50 words are displayed by the client for each time of repeated carving, the user sends out corresponding voice, the voice of the user is collected and uploaded by the client for repeated carving, and the next 50 words are displayed after the voice is detected to pass.

For example, the copy-on-copy client may first upload the user audio file to the copy-on-copy service server, and then upload the user audio file to the model training server by the copy-on-copy service server. In one embodiment, the copy-on-copy client may upload the user's voice recording file to the copy-on-copy service server at one time. In another embodiment, the copy-back client may also upload the user's audio files to the copy-back service server separately in multiple times. In the latter embodiment, the user's audio file may also be divided into multiple segments, and after each segment is collected, the audio file may be uploaded to the copy service server, and the copy service server detects whether the segment of audio file (i.e., the voice segment file) matches with the corresponding training text (i.e., the text segment), and continues to collect and upload the next segment if the match is found. Finally, after the whole user recording file passes the detection, the whole user recording file can be uploaded to the model training server by the repeated engraving service server.

It will be appreciated that in the case where the user training text comprises only one text segment and the user sound recording file comprises only one qualified speech segment file, the text segment is the user training text itself and the qualified speech segment file is the user sound recording file itself. That is, this case pertains to the above-described embodiment of uploading the user's sound recording file at once.

In step b, the voice of the target user may be collected within the current voice collection period to obtain a temporary voice segment file corresponding to the current text segment. For example, each time step b is executed again after the current text segment is output or for the current text segment, a subsequent predetermined period of time may be determined as the current voice collecting time, and the voice of the user is collected during the voice collecting time to obtain the temporary voice segment file. Optionally, after outputting the current text segment or when performing step b again for the current text segment, the re-etching client may receive a recording start instruction and a recording completion instruction input by the user, determine a time period from a time when the recording start instruction is received to a time when the recording completion instruction is received as the current voice collecting time, and collect the voice of the user within the voice collecting time to obtain the temporary voice segment file.

The scheme of sectional acquisition, detection and uploading is convenient for timely finding and correcting the recording error of the user, and is convenient for reducing the workload of the user so as to obtain better user experience.

According to an embodiment of the present invention, the first preset condition includes: the number of times step b is performed for the current text segment does not exceed the first number threshold.

The first time threshold may be any suitable value, which may be set as desired. For example, the first time threshold may be 3 times, 5 times, 10 times, etc. For a certain segment of characters, if the user does not match the recording for many times (that is, when the first threshold is reached), the recording and detection for the segment of characters may be stopped, the user may be prompted to fail to record, and be required to start recording again from the beginning, or return to the initial page, or perform other operations. The scheme of determining whether to stop repeating the recording based on the number of times of performing the step b is merely an example and not a limitation of the present invention, and the first preset condition may be set to other suitable conditions. For example, the first preset condition may further include: the total recording duration for the current text segment does not exceed the first time threshold.

According to the embodiment of the present invention, before the collecting the voice of the target user to obtain the temporary voice segment file corresponding to the current text segment, step b further includes: and outputting the current text segment.

For example, the copy client may output the user training text once or may output the user training text multiple times. Under the condition that the user training text is output by the repeated carving client at one time, the voice of the user can be collected and uploaded at one time, and the voice of the user can be collected and uploaded in a segmented mode. Under the condition that the repeated carving client outputs the training text of the user for multiple times, one section of text can be output every time, the voice of the user is collected aiming at the current section of text, the voice text matching operation is executed aiming at the collected current voice section file, and the next section of text is output only after the matching is passed.

According to the embodiment of the present invention, before step c, acquiring the voice of the target user to obtain the user audio file further includes: uploading the temporary voice segment file to a re-engraving service server so as to perform voice recognition on the temporary voice segment file on the re-engraving service server, and judging whether the character information contained in the current text segment is matched with the character information expressed by the temporary voice segment file on the re-engraving service server based on a voice recognition result; and receiving a matching result which is returned by the repeated etching service server and is about whether the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file.

As described above, the user audio file may be collected and uploaded to the duplication service server in segments, and a voice text matching operation is performed on the duplication service server, and each segment of matching is performed by collecting and uploading the next segment of audio. Finally, all the voice segment files are uploaded to the model training server through the repeated carving service server. Figure 3b shows such an embodiment.

According to the embodiment of the invention, the step of receiving the matching result which is returned by the repeated etching service server and is about whether the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file comprises the following steps: and calling back the matching result through the first call-back function.

The target application (or the duplication SDK) may create a callback function and initiate a callback request to the caller of the callback function (e.g., the duplication service server), while the target application (or the duplication SDK) may be tasked with other tasks. After the data (such as the matching result) required by the target application (or the repeated-moment SDK) is prepared, the caller plugs the data into the callback function, and the target application (or the repeated-moment SDK) can immediately know the data after the data is plugged and can get the required data through the callback function. The caller may also place an empty result in the callback function if the data needed by the target application (or the replicated SDK) does not exist. Thus, the target application (or the replicated SDK) can check whether the callback function has the required data.

The required data is called back through the call-back function, so that the target application (or the repeated SDK) can synchronously execute other tasks without waiting for the result all the time. This may improve the processing efficiency of the target application (or the duplication SDK).

According to the embodiment of the present invention, before step c, acquiring the voice of the target user to obtain the user audio file further includes: uploading the temporary voice segment file to a re-engraving service server so as to perform voice recognition on the temporary voice segment file on the re-engraving service server; receiving a voice recognition result returned by the repeated carving service server; and judging whether the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file or not based on the voice recognition result.

According to the embodiment of the invention, the step of receiving the voice recognition result returned by the repeated carving service server comprises the following steps: and calling back the voice recognition result through a sixth calling back function.

According to the embodiment of the present invention, before step c, acquiring the voice of the target user to obtain the user audio file further includes: carrying out voice recognition on the temporary voice segment file; and judging whether the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file or not based on the voice recognition result.

According to the embodiment of the present invention, before step c, acquiring the voice of the target user to obtain the user audio file further includes: and outputting the character information expressed by the temporary voice segment file in real time.

And outputting the text information expressed by the temporary voice segment file in real time for a target user to view. After the text information expressed by the temporary voice segment file is determined through voice recognition, the recognition result of the text information can be output in real time. Therefore, the target user can confirm whether the pronunciation is correct or not at a glance, so that the user can correct pronunciation errors in time and send correct voice as soon as possible.

In the case of performing speech recognition on the temporary speech segment file through the re-etching service server, the re-etching client may receive a speech recognition result (i.e., text information expressed by the temporary speech segment file) from the re-etching service server and output the result. Under the condition that the temporary voice segment file is subjected to voice recognition by the repeated carving client, the voice recognition result can be directly output.

According to the embodiment of the invention, before the voice of the target user is collected to obtain the user recording file, the client operation further comprises a noise detection operation, and the noise detection operation comprises: step e: collecting environmental sounds to obtain environmental sound data; step f: carrying out noise detection on the environmental sound data; wherein the step of collecting the voice of the target user to obtain the user's audio file is performed under a condition that the noise detection passes. Alternatively, the step of outputting the user training text may be performed in a case where the noise detection is passed.

As shown in fig. 3a, noise detection may be performed by the duplicated SDK prior to formal recording. Alternatively, the microphone authority detection may be performed before and during the noise detection is turned on, which may be performed frequently.

According to an embodiment of the present invention, step e comprises: collecting the decibel values of the environmental sound once every preset time interval to obtain decibel values of a first preset number, wherein the environmental sound data comprise the decibel values of the first preset number; step f comprises: judging whether a second preset number of decibel values exceed a first decibel threshold value or not in the first preset number of decibel values; and if so, determining that the noise detection fails, otherwise determining that the noise detection passes.

In step e, a first preset number of sub-decibel values are collected to obtain a first preset number of decibel values. The first preset number and the second preset number may be set as needed, and may be any suitable size, which is not limited herein. Similarly, the first decibel threshold may be set as desired, and may be any suitable size, which is not limited herein. Illustratively, the first decibel threshold may be 60 decibels, 70 decibels, 80 decibels, and so forth.

The following examples are given. In the following example, the first preset number is 12, the second preset number is 3, and the first decibel threshold is 70 decibels. For example, a total of 3 seconds of detection may be used, with 12 decibel values being collected every 250 milliseconds. If any (or non-adjacent) 3 or more decibel values of the 12 decibel values exceed 70 decibels, the noise detection is not considered to be passed. In this case, a prompt may be output to inform the user that the noise detection failed. In contrast, if none of the 12 db values exceeds 70 db (i.e., at least 10 db values are not greater than 70 db), the noise detection is considered to be passed. At this time, a prompt message may also be output to inform the user that the noise detection passed.

According to an embodiment of the present invention, after step f, the noise detection operation further includes: averaging all decibel values exceeding a first decibel threshold value in the decibel values of the first preset number under the condition that the noise detection is not passed so as to obtain a first average value; outputting a first average value; and/or averaging all decibel values in the first decibel set to obtain a second average value under the condition that the noise detection is passed, wherein the first decibel set comprises at least part of decibel values in the second decibel set, and the second decibel set comprises all decibel values which do not exceed a first decibel threshold value in a first preset number of decibel values; and outputting the second average value.

In the event of a failure of the noise detection, a failed value may optionally be output. Alternatively, the fail value may be an average of all decibel values that exceed the first decibel threshold. For example, assuming that 4 db values of 12 collected db values exceed 70 db, the 4 db values may be averaged and the average value may be output. This facilitates the user to know the current noise level and thus to select a more suitable location and/or time for the voice repeat.

Similarly, in the case where noise detection passes, a passed value may optionally be output. Alternatively, the passing value may be an average value of all decibel values of the first preset number of decibel values that do not exceed the first decibel threshold. Optionally, the passing value may be an average value of all decibel values of the first preset number of decibel values that do not exceed the first decibel threshold and are higher than the second decibel threshold. For example, if 10 db values of the 12 collected db values do not exceed 70 db, the 10 db values may be averaged, and the average value may be output. This facilitates the user to know the current noise level.

According to an embodiment of the present invention, before averaging all decibel values in the first decibel set to obtain the second average value, the noise detection operation further includes: under the condition that the noise detection is passed, finding out a decibel value lower than a second decibel threshold value from the second decibel set and discarding the found decibel value; and determining that the first decibel set comprises all decibel values retained in the second decibel set.

The second decibel threshold can be set as needed, and can be any suitable size, which is not limited herein. Illustratively, the second decibel threshold may be 10 decibels, 20 decibels, 30 decibels, and so forth. For example, assuming that none of 10 db values of 12 collected db values exceeds 70 db, and one of the 10 db values is lower than 20 db, the lower db value may be discarded, the remaining 9 db values may be averaged, and the average value may be output. A too low decibel value may be a problematic noise acquisition result, i.e. may be invalid data, which is discarded, reducing the error of noise detection.

According to an embodiment of the present invention, during the step e, the noise detecting operation further includes: calling back each collected decibel value through a second calling back function; and outputting the decibel value obtained by the second callback function in real time.

And outputting the decibel value obtained by the second callback function in real time to be checked by a target user, so that the user can know the noise condition of the current environment in more detail.

Referring to fig. 3a, in the noise detection process, the collected decibel value may be recalled each time, and the advantage of the callback may refer to the above description, which is not described herein.

According to an embodiment of the present invention, before the noise detection operation, the client operation further includes: noise configuration information is received from the duplication service server, and the noise configuration information comprises one or more items of a first preset number, a second preset number and a first decibel threshold value.

In an embodiment employing the second decibel threshold described above, the noise configuration information may include one or more of the first preset number, the second preset number, the first decibel threshold, and the second decibel threshold.

The information of the first preset number, the second preset number and the first decibel threshold value can be set by the repeated carving service server as configuration information, and the repeated carving SDK can acquire the information from the repeated carving service server and perform subsequent noise detection processing based on the acquired information.

According to the embodiment of the invention, the step of obtaining the user training text from the repeated carving service server comprises the following steps: acquiring at least one preset training text corresponding to a target application from a repeated carving service server, wherein the target application is an application for realizing client operation; receiving noise configuration information from the duplication service server includes: receiving at least one group of preset configuration information corresponding to at least one preset training text one by one from a repeated carving service server; prior to the noise detection operation, the client operation further comprises: selecting a specific preset training text from at least one preset training text as a user training text; and selecting preset configuration information corresponding to the training text of the user from at least one group of preset configuration information as noise configuration information.

As described above, multiple sets of training text may be assigned to each target application, with the target application setting itself as which set of text the user assigns. For example, some users may be assigned longer user training texts and other users may be assigned shorter user training texts. The scheme of distributing a plurality of sets of training texts for the target application can provide more options for the target application, so that the target application can provide different training texts on different occasions or for different users according to the requirements of the target application.

The noise configuration information may be uniformly set, that is, the same noise configuration information is set for all the preset training texts. The noise configuration information may also be set individually, i.e. each preset training text may have its own proprietary noise configuration information, and different preset training texts may have different noise configuration information.

According to the embodiment of the invention, the step of obtaining the user training text from the repeated carving service server comprises the following steps: acquiring at least one preset training text corresponding to a target application from a repeated carving service server, wherein the target application is an application for realizing client operation; prior to the noise detection operation, the client operation further comprises: selecting a specific preset training text from at least one preset training text as a user training text; and uploading the identification information of the user training text or the user training text to a repeated carving service server, and selecting the preset configuration information corresponding to the user training text from at least one group of preset configuration information corresponding to at least one preset training text one by the repeated carving service server as noise configuration information. The duplication client may then receive the selected noise configuration information from the duplication service server without receiving all of the at least one set of preset configuration information.

The multiple engraving client can upload the selected user training text to the multiple engraving service server, and also can upload the identification information of the user training text to the multiple engraving service server as long as the multiple engraving service server can obtain the training text finally selected by the multiple engraving service server. The identification information of the user training text may be any information that can be used to identify the user training text, for example, each preset training text may have its own unique number, which is the identification information of each preset training text.

According to an embodiment of the present invention, after step f, the noise detection operation further includes: and under the condition that the noise detection is not passed, if the second preset condition is met, returning to the step e, and if the second preset condition is not met, outputting noise detection failure information.

According to an embodiment of the present invention, the second preset condition includes: the number of times the noise detection fails is less than the second threshold.

The second time threshold may be any suitable value, which may be set as desired. For example, the second number threshold may be 3 times, 5 times, 10 times, etc. If the number of times of failure of noise detection is too large and exceeds a certain threshold, noise detection can be stopped, but noise detection failure information is output to prompt a target user to adjust the sound collection place and/or time, and sound reproduction is performed in a better environment. The scheme of determining whether to stop repeating the noise detection based on the number of times the noise detection fails is merely an example and not a limitation of the present invention, and the second preset condition may be set to other suitable conditions. For example, the second preset condition may further include: the total duration of the noise detection does not exceed the second time threshold.

According to the embodiment of the invention, the noise detection operation and the step of acquiring the user training text from the copy service server are asynchronously executed through an asynchronous thread.

As shown in fig. 3a, the noise detection operation and the step of obtaining the user training text may be performed asynchronously. For example, after the initialization SDK is completed, the re-engraving SDK authentication is completed to obtain the re-engraving token, and then the user training text may be obtained. Meanwhile, the noise detection operation can be performed in the background. The two operations are executed asynchronously, so that the time can be greatly saved, and the efficiency of sound reproduction is improved.

According to the embodiment of the present invention, before step e and during the execution of step e, the client operation further includes: detecting whether a target application has microphone permission, wherein the target application is an application for realizing client operation; wherein step e is performed if the target application has microphone permission.

According to the embodiment of the present invention, before acquiring the voice of the target user to obtain the user audio file and during the execution process of acquiring the voice of the target user to obtain the user audio file, the client operation further includes: detecting whether a target application has microphone permission, wherein the target application is an application for realizing client operation; the step of acquiring the voice of the target user to obtain the user recording file is executed under the condition that the target application has the microphone authority.

Referring to fig. 3a, the repeated SDK may continuously perform the microphone authorization detection step before noise detection, before formal recording, or during the execution of these steps. The repeated SDK can only detect whether the microphone permission exists, if not, the target application can be informed that the microphone permission detection fails, and the target application requests the user to open the microphone permission.

According to the embodiment of the present invention, after detecting whether the target application has the microphone right, the client operation further includes: and if the target application does not have the microphone permission, returning notification information about the failure of the microphone permission to the target application so as to output permission application information by the target application, wherein the permission application information is used for applying the microphone permission to the target user.

The implementation of applying for microphone authorization will be understood by those skilled in the art and will not be described in detail here.

According to the embodiment of the present invention, before obtaining the user training text from the duplication service server, the client operation further includes: uploading an authorization request to an authorization server to authenticate on the authorization server, wherein the authorization request comprises a client identifier and a client password of a target application, and the target application is an application for realizing client operation; receiving a re-carving token sent by an authorization server, wherein the re-carving token is generated by the authorization server based on a client identifier and a client password under the condition that the target application passes authentication, and the re-carving token is used for representing the legality of the target application; wherein the step of obtaining the user training text from the copy service server is performed in the event that the copy token is received.

Referring to fig. 3a, the double-etched SDK may first upload the ClientID and ClientSecret of the target application to the authorization server. And the authorization server judges whether the target application has the copy right or not through the ClientID and the ClientSecret. If so, the resculpting token may be returned to the resculpting SDK. The number of multiple-etching times per target application may be set to be limited, for example, 100 times for application a and only 50 times for application B. Once the number of repeated engraving times reaches the threshold, the repeated engraving token is not issued, or the repeated engraving token can be issued normally, but the subsequent repeated engraving service server verifies the number of repeated engraving times, and the application reaching the threshold of the number of repeated engraving times cannot obtain the model training resource.

According to the embodiment of the present invention, before obtaining the user training text from the duplication service server, the client operation further includes: uploading an authorization request to an authorization server for authentication on the authorization server, wherein the authorization request comprises a user identifier and a user password of a target user; receiving a re-carving token sent by an authorization server, wherein the re-carving token is generated by the authorization server based on a user identifier and a user password under the condition that the target user passes the authentication, and the re-carving token is used for representing the legality of the target user; wherein the step of obtaining the user training text from the copy service server is performed in the event that the copy token is received.

Under the condition that the JS language module is adopted to realize the client operation, authentication can be carried out based on a user identifier (UserID) and a user password (UserSecret), and a re-engraving token is issued if the authentication is passed. The UserID and UserSecret may be entered into the copy client by the target user via a browser page.

According to the embodiment of the present invention, before obtaining the user training text from the duplication service server, the client operation further includes: uploading the text acquisition request and the re-engraving token to a re-engraving service server together so that the re-engraving service server or an authorization server verifies whether the re-engraving token is valid or not, wherein the text acquisition request is used for applying the re-engraving service server for acquiring a user training text; receiving a text feedback result aiming at the text acquisition request from the repeated carving service server, wherein the text feedback result comprises a user training text under the condition that the repeated carving token is effective, and the text feedback result comprises request failure information under the condition that the repeated carving token is ineffective; the step of obtaining the user training text from the copy service server comprises the following steps: and under the condition that the text feedback result comprises the user training text, acquiring the user training text from the text feedback result.

As shown in fig. 3a, the resculpting SDK may hold the resculpting token to request the resculpting service server to obtain the user training text. Optionally, the copy service server may upload the received copy token to the authorization server for validity verification. For example, the resculpting token has a validity period, and the authorization server may compare the validity period with the lifetime of the resculpting token to determine whether the resculpting token is expired. And if the verification is passed, the user training text can be returned to the repeated carving SDK. If the verification fails, request failure information can be sent to the duplication SDK. The request failure information here is used to indicate that the text acquisition request failed. The request failure information may be, for example, a null result. Optionally, the copy service server may also implement validity verification of the copy token by itself. For example, the authorization server may send the encoding information corresponding to the copy token to the copy service server, and the copy service server performs validity verification on the copy token based on the encoding information.

According to the embodiment of the invention, the step of receiving the text feedback result aiming at the text acquisition request from the repeated carving service server comprises the following steps: and calling back the text feedback result through a third calling back function.

The advantages of the callback function have been described above and will not be described in further detail here.

According to the embodiment of the present invention, after receiving the text feedback result for the text acquisition request from the duplication service server, the client operation further includes: and outputting text acquisition failure information under the condition that the text feedback result comprises request failure information.

The output text acquisition failure information may be for viewing by the target user. If the user fails to acquire the training text, corresponding prompt information can be output so that the user can obtain timely feedback.

According to the embodiment of the invention, before uploading the user recording file to the model training server directly or via the copy service server, the client operation further comprises: uploading the model training request and the re-carving token to a re-carving service server together so as to verify whether the re-carving token is valid or not by the re-carving service server or an authorization server, wherein the model training request is used for applying for distributing model training resources to the re-carving service server; receiving a resource feedback result aiming at the model training request from the repeated carving service server, wherein the resource feedback result comprises request passing information or request failure information; and the step of uploading the user audio file to the model training server directly or through the copy service server is executed under the condition that the resource feedback result comprises the request passing information.

Referring to fig. 3b, the repeated-moment SDK may apply for model training resources before formally starting the recording. And the copy service server verifies whether the copy token is valid by itself or through an authorization server. Optionally, in the case that the copy token is valid, it may further be determined whether the current target application or the current target user is qualified to use the new model training resource, and a corresponding resource feedback result is returned. In the case where the current target application or current target user is qualified to use the new model training resource, the resource feedback result includes request pass information indicating that the model training request passes. Conversely, in the event that the current target application or current target user is not eligible to use the new model training resource, the resource feedback result includes request failure information indicating that the model training request failed. Of course, in the case where the copy token is invalid, the resource feedback result includes request failure information.

Alternatively, the duplication service server may generate a model training identifier (model SID) in case the current target application or current target user is qualified to use the new model training resource. Illustratively, the model SID may include a model ID and a timestamp. The model ID may be, for example, a 32-bit identifier and the timestamp may be a 13-bit random code. The model ID is the model identifier of the personalized speech recognition model. The model SID can be regarded as an identifier of the model training task and is used as a unique identifier of each model training task.

When the copy service server receives the model training request, the authorization server may obtain the authority information of the current target application or the current target user. The authority information may include duplication number information and the like. Illustratively, the duplication count information may include a duplication count threshold value and a number of duplicated times. After the re-engraving service server acquires the re-engraving frequency information, the residual re-engraving frequency can be calculated by subtracting the re-engraving frequency threshold from the re-engraving frequency. In addition, the resculpting service server can also cache the number of model training tasks (referred to as training task number for short) related to the target application or the target user currently being executed, compare the remaining resculpting times with the training task number, and if the remaining resculpting times are greater than the training task number, determine that the current target application or the current target user is qualified to use the new model training resource, generate a new model SID, and assign the generated model SID to the resculpting SDK. Conversely, if the number of remaining iterations is less than or equal to the number of training tasks, it may be determined that the current target application or the current target user is not eligible to use the new model training resource.

For example, for application C, the authorization server stores two values, 10 and 8, the former is the threshold of the number of repeated times, and the latter is the number of repeated times, so that application C can repeat the time again twice. If application C has two users who are performing model training, it appears that a third user also wants to use the voice copy function of application C. Application C will send a model training request associated with the third user to the copy service server. And the repeated carving service server acquires a repeated carving frequency threshold value and repeated carving frequency from the authorization server, finds that the residual repeated carving frequency is 2, and finds that the number of training tasks currently executed is also 2, and determines that the model training resources cannot be distributed to the third user. Each time one model training task is completed, the information of the repeated times stored in the authorization server can be updated in time, namely, the times are added by 1.

According to the embodiment of the invention, the receiving of the resource feedback result aiming at the model training request from the repeated carving service server comprises the following steps: and calling back the resource feedback result through a fourth call-back function.

According to an embodiment of the present invention, after uploading the user's voice recording file to the model training server directly or via the duplication service server, and before receiving training result information of the personalized speech synthesis model from the model training server directly or via the duplication service server, the client operation further includes: and uploading the training opening request and the re-carving token to a re-carving service server together so as to verify whether the re-carving token is valid or not by the re-carving service server or an authorization server, wherein the training opening request is used for applying for opening model training to the re-carving service server, and the model training server starts to train the personalized speech synthesis model under the condition that the re-carving token is valid.

Referring to fig. 3b, it is shown that after the recording is completed, the repetitive-etch SDK may send a training start request to control the start of the model training. This is merely an example and not a limitation of the present invention, e.g., a copy-and-turn business server or a model training server may automatically initiate training of the model after the formal recording is completed.

According to the embodiment of the invention, the receiving of the training result information of the personalized speech synthesis model from the model training server directly or via the duplication service server comprises: and calling back the training result information through a fifth calling back function.

The result information of the model training can be directly returned to the repeated carving SDK by the model training server. Preferably, the model training server sends the training result information to the repeated carving service server, and the repeated carving service server sends the repeated carving SDK.

According to an embodiment of the invention, the training result information comprises a model identifier of the personalized speech synthesis model, the model identifier being created by the copy service server or the model training server.

As shown in fig. 3b, the model ID may be returned to the replication client, which is stored by the replication client. The model ID and/or model SID may each be created by a duplication service server or a model training server.

According to an embodiment of the invention, the client operation further comprises: the user identifier of the target user is uploaded to the duplication service server to associate the user identifier with the model identifier by the duplication service server.

The user ID may be represented by a UserID or a QueryID (as shown in fig. 3 a). The user ID may be uploaded to the copy service server at any time, preferably included in the model training request or training start request, i.e. uploaded when applying for model training resources or applying for starting model training. The server associates the model ID with the user ID to record to which user each speech synthesis model belongs. This facilitates retrieval of a personalized speech synthesis model associated with the user requesting synthesis when speech synthesis is subsequently performed.

According to an embodiment of the present invention, the client operation is performed by a copy-and-copy software development kit integrated in the target application, and before the client operation, the method further includes: and initializing a re-etching software development kit.

Referring to fig. 3a, the initialization steps for the double etch SDK are shown.

According to an embodiment of the present invention, the client operation is performed by a web page programming language loaded into the target application, and before the client operation, the method further includes: receiving a webpage browsing instruction which is input by a target user and is related to a target application; sending a link request to a re-engraving service server based on a webpage browsing instruction; receiving response information which is returned by the repeated etching service server and aims at the link request, wherein the response information comprises a webpage programming language; rendering a web page related to the target application based on the response information; receiving a model training instruction input by a target user in a webpage; wherein the client-side operations are performed in response to receipt of the model training instructions and through the web page programming language.

As described above, the sound reproduction function can be realized in JS language in addition to development into the SDK form. The copy service server can be used to store and maintain the JS language. The target application may provide a user interaction interface implemented in a browser. When a user interacts with the target application through an interaction interface (such as some operable controls and the like) related to the starting of the voice copy-on function (that is, the target application receives a web browsing instruction input by the target user), the target application may be triggered to send a link request to the copy-on service server. The link request may link to a web address associated with the voice copy function. And responding to the link request, and returning corresponding response information to the repeated carving client by the repeated carving service server. Based on the response information, a webpage may be rendered, and various information related to the sound reproduction function and a user interaction interface (e.g., some operable controls, etc.) for assisting a user in implementing the sound reproduction function may be displayed on the rendered webpage. The response information returned by the duplication service server may include a JS language module, and the JS language module may be loaded into the target application in the rendering process, so as to perform each step in the client operation described herein. The model training instructions may include, but are not limited to, a repeat-on instruction, etc., the user inputs the repeat-on instruction, and the target application may begin executing the JS language module in response to the repeat-on instruction. The type and input mode of the model training instructions may be self-defined by the technical developer of the target application.

The operation of the copy service server side is described below.

According to another aspect of the invention, an interactive method for implementing personalized speech synthesis model training is provided. FIG. 4 shows a schematic flow diagram of an interactive method 400 for implementing personalized speech synthesis model training, according to one embodiment of the present invention. The interaction method 400 is applied to a copy service server. As shown in fig. 4, the interaction method 400 includes server-side operations including steps S410-S450.

In step S410, the user training text is sent to the copy client.

In step S420, a user recording file is received from the copy client.

In step S430, in a case that the text information included in the user training text matches the text information expressed by the user recording file, the user recording file is uploaded to the model training server, so that the personalized speech synthesis model specific to the target user is trained on the model training server based on the user recording file.

In step S440, training result information of the personalized speech synthesis model is received from the model training server.

In step S450, the training result information is sent to the copy client.

The operation of the duplication service server can be understood in conjunction with the above description, and is not described in detail here.

Referring back to FIG. 1, the collected user sound files may also be audio pre-processed prior to model training on the model training server. The audio pre-processing may be performed by a copy service server or a model training server. The audio pre-processing may include, but is not limited to, for example, framing, noise reduction, filtering, volume adjustment, etc., and those skilled in the art can understand the manner of audio pre-processing, which is not described herein.

According to the embodiment of the invention, the user training text comprises at least one text segment, the user recording file comprises at least one qualified voice segment file corresponding to the at least one text segment one by one, the client is used for collecting a temporary voice segment file corresponding to a current text segment in the at least one text segment, and the receiving the user recording file from the copy client comprises: receiving a temporary voice segment file from a re-engraving client; prior to step S430, the server-side operations further include: carrying out voice recognition on the temporary voice segment file; judging whether the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file or not based on the voice recognition result; sending a matching result about whether the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file to the re-engraving client; and under the condition that the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file, determining that the temporary voice segment file is a qualified voice segment file corresponding to the current text segment.

The above and the description describe the embodiment of performing the voice text matching operation at the copy service server, which is not described herein again.

According to the embodiment of the present invention, judging whether the text information contained in the current text segment matches the text information expressed by the temporary speech segment file based on the speech recognition result comprises: and if the matching rate between the text information contained in the current text segment and the text information expressed by the temporary voice segment file exceeds a specific matching threshold, determining that the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file, otherwise, determining that the text information contained in the current text segment is not matched with the text information expressed by the temporary voice segment file.

The specific matching threshold can be set to any suitable value as desired, and is not set by the present invention. For example, the specific matching threshold may be 90%, 95%, 100%. When the voice is repeated, the pronunciation of the target user is required to be completely consistent with the user training text, or the pronunciation of the target user is only required to be mostly consistent with the user training text, namely, the subsequent operation can be performed through the voice text matching operation. The strict degree of the requirement on the pronunciation of the user can be adjusted through the specific matching threshold, a smaller specific matching threshold can provide a looser requirement on the user, the user experience is better, a more accurate user recording file can be obtained through the larger specific matching threshold, and the personalized speech synthesis model with higher synthesis precision can be obtained through subsequent training.

According to the embodiment of the invention, the step of sending the user training text to the repeated carving client comprises the following steps: at least one preset training text corresponding to the target application on the re-carving client is sent to the re-carving client, and the at least one preset training text has a corresponding preset matching threshold value; before judging whether the text information contained in the current text segment is matched with the text information expressed by the temporary speech segment file based on the speech recognition result, the server-side operation further comprises: receiving a user training text or identification information of the user training text from the repeated carving client, wherein the user training text is selected and obtained from at least one preset training text by the repeated carving client; and determining a preset matching threshold corresponding to the training text of the user as a specific matching threshold. The target application is the application described above that interacts with the copy server to enable client operation.

As described above, the copy service server may allocate multiple sets of training texts to the target application, and the target application sets which set of training texts is allocated to the user. In this case, before, after or simultaneously with uploading the user recording file to the rescale service server, the rescale client may also upload the user training text finally provided by the target application to the target user or the identification information of the user training text to the rescale service server, so that the rescale service server can know which set of training text the target application selects to provide to the user, and further the rescale service server can find out the correct user training text for text comparison when performing the voice text matching operation.

In addition, the same matching threshold value can be adopted for different preset training texts, and different matching threshold values can also be adopted. Different preset training texts adopt different matching thresholds, which is beneficial to formulating a more targeted matching standard according to the text content and the matching requirement so as to obtain a more targeted, more accurate and more efficient matching result. The copy service server may store a preset matching threshold corresponding to each set of preset training text. After the training text finally selected by the target application is determined, the repeated carving service server can find out a corresponding preset matching threshold value and perform voice text matching operation based on the threshold value.

According to the embodiment of the invention, the step of sending the matching result of whether the text information contained in the current text segment is matched with the text information expressed by the temporary voice segment file to the re-engraving client comprises the following steps: and calling a first callback function provided by the re-engraving client so as to send the matching result to the re-engraving client through the first callback function.

According to the embodiment of the invention, the step of sending the user training text to the repeated carving client comprises the following steps: sending at least one preset training text corresponding to the target application on the re-carving client to the re-carving client; before uploading the user audio file to the model training server, the server-side operations further comprise: and receiving the user training text or the identification information of the user training text from the repeated carving client, wherein the user training text is selected and obtained from at least one preset training text by the repeated carving client.

According to the embodiment of the present invention, before sending the user training text to the copy client, the server-side operation further includes: receiving a text acquisition request and a re-carving token uploaded by a re-carving client, wherein the text acquisition request is used for applying for acquiring a user training text, the re-carving token is generated by an authorization server based on a user identifier and a user password of a target user under the condition that the target user passes authentication, and the re-carving token is used for representing the legality of the target user; or the copy-on-copy token is generated by the authorization server based on the client identifier and the client password of the target application under the condition that the target application on the copy-on-copy client passes the authentication, and the copy-on-copy token is used for representing the legality of the target application; uploading the copy token to an authorization server so that the authorization server verifies whether the copy token is valid; receiving verification result information returned by the authorization server; and sending the user training text to the repeated carving client side under the condition that the verification result information shows that the repeated carving token is effective.

According to the embodiment of the invention, the step of sending the user training text to the repeated carving client comprises the following steps: and calling a third callback function provided by the re-engraving client so as to send the user training text to the re-engraving client through the third callback function.

According to the embodiment of the present invention, before receiving the user audio file from the copy client, the server further includes: receiving a model training request and a copy token uploaded by a copy client, wherein the model training request is used for applying for distributing model training resources, the copy token is generated by an authorization server based on a user identifier and a user password of a target user under the condition that the target user passes authentication, and the copy token is used for representing the legality of the target user; or the copy-on-copy token is generated by the authorization server based on the client identifier and the client password of the target application under the condition that the target application on the copy-on-copy client passes the authentication, and the copy-on-copy token is used for representing the legality of the target application; uploading the copy token to an authorization server so that the authorization server verifies whether the copy token is valid; receiving verification result information returned by the authorization server; under the condition that the verification result information indicates that the repeated-engraving token is valid, judging whether the target application or the target user is qualified to obtain the model training resource so as to obtain a resource feedback result, wherein the resource feedback result comprises request passing information or request failure information; sending the resource feedback result to the re-engraving client; and the step of receiving the user sound recording file from the re-engraving client is executed under the condition that the resource feedback result comprises the request passing information.

According to the embodiment of the present invention, before determining whether the target application or the target user is qualified to obtain the model training resource, the server-side application further includes: receiving the copy right information of the target application or the target user from the authorization server; determining whether the target application or the target user is eligible for the model training resource includes: and judging whether the target application or the target user is qualified to obtain the model training resource or not based on the repeated carving authority information.

According to the embodiment of the invention, the repeated engraving authority information comprises the repeated engraving frequency threshold value and the repeated engraving frequency of the target application or the target user, and the step of judging whether the target application or the target user is qualified to obtain the model training resource based on the repeated engraving authority information comprises the following steps: calculating the difference between the repeated engraving times threshold and the repeated engraving times to obtain the residual repeated engraving times; and comparing the residual repeated engraving times with the number of training tasks, if the residual repeated engraving times is greater than the number of training tasks, determining that the target application or the target user is qualified to obtain the model training resources, otherwise, determining that the target application or the target user is not qualified to obtain the model training resources, wherein the number of training tasks is the number of model training tasks which are currently executed and are related to the target application or the target user.

According to the embodiment of the invention, the step of sending the resource feedback result aiming at the model training request to the repeated engraving client comprises the following steps: and calling a fourth callback function provided by the re-engraving client so as to send the resource feedback result to the re-engraving client through the fourth callback function.

According to the embodiment of the present invention, before receiving the user audio file from the copy client, the server further includes: in the event that the target application or target user is eligible for a model training resource, a model training identifier is created, wherein the model training identifier includes a model identifier and a timestamp of the personalized speech synthesis model, the model training identifier for identifying a model training task related to training of the personalized speech synthesis model.

According to the embodiment of the present invention, before receiving the training result information of the personalized speech synthesis model from the model training server, the server-side operation further includes: receiving a training opening request and a re-carving token uploaded by a re-carving client, wherein the training opening request is used for applying for opening model training, the re-carving token is generated by an authorization server based on a user identifier and a user password of a target user under the condition that the target user passes authentication, and the re-carving token is used for representing the legality of the target user; or the copy-on-copy token is generated by the authorization server based on the client identifier and the client password of the target application under the condition that the target application on the copy-on-copy client passes the authentication, wherein the copy-on-copy token is used for representing the legality of the target application and is used for representing the legality of the target application; uploading the copy token to an authorization server so that the authorization server verifies whether the copy token is valid; receiving verification result information returned by the authorization server; and under the condition that the verification result information indicates that the repeated token is valid, sending a training starting notice to the model training server to inform the model training server to start the training of the personalized speech synthesis model.

According to the embodiment of the invention, the step of sending the training result information to the repeated engraving client comprises the following steps: and calling a fifth callback function provided by the re-engraving client so as to send the training result information to the re-engraving client through the fifth callback function.

According to the embodiment of the present invention, before sending the training result information to the duplication client, the server-side operation further includes: creating a model identifier for the personalized speech synthesis model; alternatively, receiving a model identifier from a model training server; wherein the training result information includes a model identifier.

According to the embodiment of the present invention, the server-side operation further includes: receiving a user identifier of a target user from a copy client; and associating the user identifier with the model identifier.

The operation of the model training server side is described below.

According to one aspect of the invention, a method for training a personalized speech synthesis model is provided. FIG. 5 shows a schematic flow diagram of a method 500 of training a personalized speech synthesis model according to one embodiment of the invention. As shown in fig. 5, the training method 500 of the personalized speech synthesis model includes steps S510, S520, S530, S540, and S550.

In step S510, a user audio file of the target user sent by the copy client directly or via the copy service server is received.

FIG. 6 illustrates a flow diagram for automated training of speech synthesis models and speech synthesis according to one embodiment of the present invention.

As shown in FIG. 6, the overall speech system may be divided, by way of example and not limitation, into at least three parts: a recording acquisition, automated training system (first system part); a cloud authorization service system (second system part); speech synthesis system (third system part). The first system part can be realized by the repeated carving client, the repeated carving service server and the model training server; the second system part may be implemented by the authorisation server; the third system part may be implemented by a composition service server (in conjunction with a model training server). The cloud authorization service system is optional.

Alternatively, any one or more of the duplication service server, the model training server, the authorization server, and the composition service server may be implemented using the same server. For example, the model training server and the authorization server may be the same server, the model training server and the copy services server may be the same server, and so on. Optionally, the multiple engraving business server, the model training server, the authorization server, and the composition business server may be implemented by separate servers, and of course, any one of the multiple engraving business server, the model training server, the authorization server, and the composition business server may be further divided into a plurality of sub-servers, and these sub-servers cooperate with each other to implement the functions of the server.

As shown in fig. 6, in the first system part, the copy-on-demand client (specifically, the copy-on-demand SDK integrated in the target application or the JS language module loaded in the target application on the copy-on-demand client) uploads the user voice record file to the copy-on-demand service server (corresponding to the "copy-on-demand service" in fig. 6) through the gateway portal. The user recording file is obtained by carrying out voice acquisition on a target user when the user training text is output by the repeated carving client. Specifically, the user training text may be preset and provided to the target user by the copy client, i.e., displayed to the target user. And the target user sends out corresponding voice according to the displayed user training text, and the repeated carving client records the voice of the user to obtain a user recording file. Therefore, the speech content of the user's audio file and the text content of the user's training text should be matched. As described above, before uploading the user audio file to the model training server, the content of the user audio file and the content of the user training text may optionally be compared by the duplication client or the duplication service server, and if the two contents do not match, the user may be prompted to re-record until the content of the user audio file and the content of the user training text match.

In step S520, dynamically adding a particular compute node in the container cluster is scheduled.

Alternatively, the model training server may start to perform model training, i.e., start to perform step S520 and subsequent steps, in response to the reception of the user audio file. Alternatively, the model training server may start the model training after receiving the training start notification from the duplication service server or the duplication client, that is, start to execute step S520 and subsequent steps.

The model training server may be a cloud computing system that may be implemented using any existing or future emerging clustering techniques, including but not limited to docker clustering, and the like. As shown in fig. 6, the cloud computing system may include portions of a mirror warehouse, a container cluster, a storage system, and so forth. The mirror image warehouse is used for storing training mirror images, the training mirror images are generated through a training environment of a packing standard voice synthesis model, and the standard voice synthesis model is trained on the basis of sample recording files and corresponding sample texts. The sample audio file may be any audio file of anyone that is pre-recorded and whose corresponding sample text is also known. Those skilled in the art can understand the meaning and usage of the sample audio file and the sample text, which are not described herein.

As shown in fig. 6, a background research and development staff may perform training of a standard speech synthesis model in advance through a sample recording file and a sample text of a sample staff, package the entire training environment of the standard speech synthesis model into a mirror image file (i.e., a training mirror image) after the training is qualified, and then release the training mirror image to a mirror image warehouse for storage. The training image may be constructed as an image file of any suitable format, and by way of example and not limitation, it may be constructed as a docker image. Those skilled in the art will appreciate that the training environment may include information such as the application program used in training the speech synthesis model, the various codes executed, the configuration of the various parameters involved, and so on.

After the cloud computing system receives the training starting notification of the repeated service server, the dynamically newly added specific computing node in the container cluster can be called through the cluster manager. The container cluster is a cluster capable of realizing elastic expansion of computing force, namely the container cluster can be automatically expanded to obtain a new computing node when a new computing task is added, and the computing node for completing the task can be automatically released when a certain computing task is completed. Therefore, the container cluster for training can realize elastic expansion, so that the computing power of the training system can be dynamically expanded, and a large-scale automatic training task can be supported.

Illustratively, each compute node may correspond to a machine, which may be an actual hardware device or a virtual machine.

In step S530, a training image is pulled from the image repository through a specific computing node, wherein the training image is generated by a training environment that packages a standard speech synthesis model that is trained based on a sample audio file and a corresponding sample text.

A large number of model training tasks need a large number of servers, the workload is large and automation cannot be realized if each server independently installs a training environment, and automatic training environment installation can be realized if the servers are packaged into training images.

After the cluster manager schedules a new compute node, the mirror may be pulled (pull) from the mirror store using the compute node. Those skilled in the art can understand the pulling manner of the image file, and the details are not described herein.

At step S540, a training image is run on a particular compute node to expand the training environment.

After the training mirror image is pulled, the training mirror image can be automatically operated on the characteristic computing node, and then the training environment of the speech synthesis model is developed.

Step S550: and performing speech synthesis model training on a specific computing node by using a training environment and a user recording file to obtain a personalized speech synthesis model which is exclusive to a target user.

And training a speech synthesis model by utilizing a specific computing node. Specifically, while the training image is run on a specific computing node to develop the training environment, the text corresponding to the user's recording file (information such as prosody and user label may also be added) is used as the input of the speech synthesis model, and the user's recording file is output as the target of the speech synthesis model to train the speech synthesis model. The text of the input model may be the user training text or a speech recognition result obtained by recognizing a user recording file. Through training, a personalized speech synthesis model (which may also be referred to as a customized speech synthesis model) specific to the target user may be obtained. Through the personalized voice synthesis model, arbitrarily input texts can be converted into voices consistent with the timbre and rhythm of a target user.

Subsequently, the particular computing node may optionally store the trained personalized speech synthesis model in a storage system, which is a persistent storage. The speech synthesis system can read the personalized speech synthesis model from the storage system for speech synthesis when needed. After a particular computing node completes the training task for the personalized speech synthesis model of the target user, the cluster manager may release the particular computing node.

According to the method provided by the embodiment of the invention, the training environment of the speech synthesis model is packed into the training mirror image, new computing nodes are dynamically added when needed, and the training mirror image is automatically pulled by the computing nodes to carry out the training of the personalized speech synthesis model of the exclusive target user. The scheme has great application prospect and market value in the field of voice processing.

According to an embodiment of the present invention, after step S550, the method 500 may further include: pushing the personalized speech synthesis model to a storage system for storage; and removing the particular computing node from the container cluster.

Once the training of the personalized speech synthesis model is completed, the feature calculation nodes can be released, which is helpful for better realizing the elastic expansion of the calculation force. The container technology is combined with the cloud elastic computing capacity, the computing power required by training is dynamically called, the training is taken at any time, and the training is released after use, so that the requirement for large amount of computing in a peak period can be met, and unnecessary resource waste in a low peak period can be avoided.

According to an embodiment of the present invention, the method 500 may further include: receiving a new training mirror image sent by mirror image generation equipment; and pushing the new training mirror image to a mirror image warehouse so as to update the training mirror image in the mirror image warehouse.

The image generation device is a device used by a background developer and may be any suitable device including, but not limited to, a personal computer, a smartphone, a tablet computer, or some server device, among others.

When the training environment of the speech synthesis model needs to be updated, background research and development personnel can pack the new training environment into a new training mirror image, and the new training mirror image is issued to the mirror image warehouse at the cloud end after the test is passed. And the model training server receives the new training mirror image, pushes the new training mirror image into the mirror image warehouse, replaces the original training mirror image, and updates the original training mirror image.

By the scheme, the training mirror image can be updated very conveniently.

According to the embodiment of the present invention, step S550 includes: performing voice recognition on the user recording file to obtain a voice recognition result corresponding to the user recording file, wherein the voice recognition result comprises a text, a rhythm and a user tag; and performing speech synthesis model training through the user recording file and the speech recognition result to obtain an individualized speech synthesis model.

Prosody refers to information such as initials, finals, pauses and the like. The user tag may include tag information such as the user's age, gender, etc.

As described above, the user's voice recording file may be speech recognized by the model training server, and then model training may be performed based on the recognition result and the user's voice recording file.

According to an embodiment of the present invention, after step S510, the method 500 may further include: and pushing the user recording file to a storage system for storage.

Besides the trained personalized speech synthesis model, the user recording file used for training can also be stored persistently. Therefore, the user recording file can be read and used at any time in the training process, and the user recording file can be used for more application scenes subsequently.

Before scheduling the dynamically added specific compute node in the container cluster, according to an embodiment of the present invention, the method 500 further includes: receiving a training start notification sent by a repeated carving service server; the step of scheduling the dynamically added particular compute node in the container cluster is performed in response to receipt of a training open notification. This embodiment can be understood with reference to the above description, which is not repeated here.

After performing speech synthesis model training on a particular computing node using a training environment and a user's voice recording file, method 500 further comprises, in accordance with an embodiment of the present invention: and sending the training result information of the personalized speech synthesis model to the re-carving client or the re-carving service server.

After the training of the personalized speech synthesis model of the target user is completed, the training result information can be sent to the re-carving client or the re-carving service server, so that the re-carving client outputs feedback information about whether the training is completed or not to the target user. The target user may then perform speech synthesis via the synthesis client (e.g., a synthesis SDK integrated in a target application on the synthesis client) at any time. The composition client may be the same client as the copy client described above or a different client. The target application on the composition client and the target application on the copy client may be the same application or different applications.

According to another aspect of the present invention, a speech synthesis method is provided. FIG. 7 shows a schematic flow diagram of a speech synthesis method 700 according to one embodiment of the invention. As shown in fig. 7, the speech synthesis method 700 includes steps S710 and S720. Alternatively, the speech synthesis method 700 may be applied to a model training server, that is, the model training server performs the function of a synthesis service server similar to a copy service server, and is mainly used for interacting with a synthesis client and the model training server and transmitting information. Alternatively, the speech synthesis method 700 may also be applied to the synthesis service server, i.e. be done by the synthesis service server itself.

In step S710, a text to be synthesized of the target user sent by the synthesis client directly or via the synthesis service server is received.

In step S720, the text to be synthesized is input into the personalized speech synthesis model of the target user obtained by training in the training method 500 for speech synthesis, so as to obtain the target speech corresponding to the text to be synthesized.

As shown in fig. 6, in a speech synthesis system, a synthesis request may be sent by a synthesis SDK on a synthesis client to a synthesis services server (shown as "synthesis service" in fig. 6) via a gateway portal. The synthesis service server may invoke a synthesis engine that reads the personalized speech synthesis model from the storage system for speech synthesis.

According to the embodiment of the present invention, before performing speech synthesis on the personalized speech synthesis model of the target user obtained by training the text to be synthesized by the input training method 500, the method 700 further includes: receiving a synthesis request and a synthesis token sent by a synthesis client directly or via a synthesis service server, wherein the synthesis request is used for applying for speech synthesis authority, and the synthesis token is generated by an authorization server based on a user identifier and a user password of a target user under the condition that the target user passes authentication, and the synthesis token is used for representing the legality of the target user; or the synthetic token is generated by the authorization server based on the client identifier and the client password of the target application under the condition that the target application authentication on the synthetic client is passed, and the synthetic token is used for representing the legality of the target application; uploading the composite token to an authorization server to verify by the authorization server whether the composite token is valid; receiving verification result information returned by the authorization server; the step of performing speech synthesis on the personalized speech synthesis model of the target user obtained by training the text to be synthesized by the training method 500 is executed under the condition that the synthesis token is valid.

As shown in fig. 6, the validity of the target application or the target user may also be verified by the authorization server during speech synthesis. Initially, the composition client may send a client identifier and a client password of its target application or a user identifier and a user password of a target user to the authorization server directly or via the composition service server to authenticate the target application or the target user, and the composite token is issued only after the authentication is passed. The method for issuing and verifying the composite token can be understood by referring to the above method for issuing and verifying the composite token, which is not described herein again.

After the composition client obtains the composition token, the composition token may be uploaded to the composition service server together with the composition request. After the verification of the synthesis token is valid, the synthesis service server can perform speech synthesis by itself, or further upload the text to be synthesized to the model training server, and perform speech synthesis by the model training server.

According to the embodiment of the present invention, before performing speech synthesis on the personalized speech synthesis model of the target user obtained by training the text to be synthesized by the input training method 500, the method 700 further includes: receiving, from an authorization server, composition permission information related to a target application or a target user on a composition client, the composition permission information including one or more of a composition number threshold, a number of times of composition, a number of composition words threshold, a composition interface concurrency processing amount, and a composition interface call number upper limit; judging whether the target application or the target user is qualified for voice synthesis based on the synthesis permission information; the step of performing speech synthesis on the personalized speech synthesis model of the target user obtained by training the text to be synthesized by the input training method 500 is performed when the target application or the target user is qualified for speech synthesis.

The meaning and the using method of the combining time threshold and the combined time can be understood according to the meaning and the using method of the repeated time threshold and the repeated time threshold, which is not described herein.

The composite word count threshold refers to a limit on the maximum number of words that a target application or target user can composite. For example, an application or a user in the application is restricted to synthesizing up to 1000 words, and if the number of words synthesized for the application or the user reaches this number of words, then no synthesis resources are allocated for the application or the user thereafter, i.e. they are not allowed to continue to be synthesized.

The synthesis interface concurrency processing amount (synthesis QPS) and the number of times of synthesis interface calls are limitations on the upper limit of use of the synthesis interface of the speech synthesis server. Those skilled in the art can understand the meaning of the synthetic QPS and the synthetic interface call times, which are not described herein. The speech synthesis server is a server for performing steps S710 and S720, and may be a combination of one or more of the synthesis service server, the model training server, and other separate servers, as long as it can call the personalized speech synthesis model for speech synthesis.

Further, as shown in fig. 6, an authorization server (including the authentication service and authorization management back-office shown in fig. 6) may store authorization information. The authorization information may include the above-described copy right information and/or composition right information.

According to the embodiment of the present invention, the speech synthesis of the personalized speech synthesis model of the target user obtained by training the text to be synthesized by the input training method 500 includes: calling a synthesis engine; obtaining the personalized voice synthesis model from the storage system by the synthesis engine; and inputting the text to be synthesized into the personalized speech synthesis model through the synthesis engine for speech synthesis.

Referring to fig. 6, an example of the invocation of the synthesis engine and the speech synthesis by the synthesis engine is shown and will not be described herein.

Illustratively, the servers described herein (including model training servers, copy services servers, composition services servers, authorization servers, etc.) may provide clients (including copy clients, composition clients) with various access capabilities such as SDK, hypertext transfer protocol (HTTP), Media Resource Control Protocol (MRCP), presentation layer state transition application program interface (Rest API), Websocket API, etc.

According to another aspect of the present invention, a method for training a personalized speech synthesis model is provided, which is used for a server (including a copy service server and/or a model training server), and may include the above-mentioned interactive method 400 for implementing personalized speech synthesis model training and the above-mentioned training method 500 for personalized speech synthesis model.

According to another aspect of the present invention, there is provided a speech synthesis method including: receiving a text to be synthesized of a target user sent by a synthesis client directly or through a synthesis service server; and inputting the text to be synthesized into the personalized speech synthesis model of the target user obtained by training according to the method for training the personalized speech synthesis model to perform speech synthesis so as to obtain the target speech corresponding to the text to be synthesized. The speech synthesis method in the present embodiment differs from the speech synthesis method 700 described above only in that: the personalized speech synthesis model adopted by the former is obtained by training through the method for training the personalized speech synthesis model, and the personalized speech synthesis model adopted by the latter is obtained by training through the training method 500 for the personalized speech synthesis model, and the personalized speech synthesis model can be kept consistent in other characteristic parts.

According to another aspect of the invention, an interactive apparatus for implementing personalized speech synthesis model training is provided. FIG. 8 shows a schematic block diagram of an interactive apparatus 800 for implementing personalized speech synthesis model training according to one embodiment of the present invention.

As shown in fig. 8, the interactive apparatus 800 for implementing personalized speech synthesis model training according to the embodiment of the present invention includes a client operation module (not explicitly shown), which includes an obtaining sub-module 810, a first output sub-module 820, a collecting sub-module 830, an uploading sub-module 840, a receiving sub-module 850, and a second output sub-module 860. The various modules may perform the various steps/functions of the interactive method 200 for implementing personalized speech synthesis model training described above in connection with fig. 1-3b, respectively. Only the main functions of the components of the interaction device 800 for implementing personalized speech synthesis model training will be described below, and details already described above will be omitted.

The obtaining sub-module 810 is configured to obtain the user training text from the duplication service server.

The first output sub-module 820 is used for outputting the user training text.

The collecting submodule 830 is configured to collect the voice of the target user to obtain a user recording file.

The uploading sub-module 840 is configured to, in a case where the text information included in the user training text matches the text information expressed by the user audio file, upload the user audio file to a model training server directly or via the duplication service server, so as to train a personalized speech synthesis model specific to the target user on the model training server based on the user audio file.

The receiving sub-module 850 is configured to receive training result information of the personalized speech synthesis model from the model training server directly or via the copy service server.

The second output sub-module 860 is configured to output feedback information regarding whether the training of the personalized speech synthesis model is completed based on the training result information.

According to another aspect of the invention, an interactive system for implementing personalized speech synthesis model training is provided. FIG. 9 shows a schematic block diagram of an interactive system 900 for implementing personalized speech synthesis model training, according to one embodiment of the present invention. An interactive system 900 for implementing personalized speech synthesis model training includes a processor 910 and a memory 920.

The memory 920 stores computer program instructions for implementing corresponding steps in the interactive method 200 for personalized speech synthesis model training according to an embodiment of the present invention.

The processor 910 is configured to execute the computer program instructions stored in the memory 920 to execute the corresponding steps of the interactive method 200 for implementing personalized speech synthesis model training according to the embodiment of the present invention.

According to another aspect of the present invention, there is provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are configured to perform the corresponding steps of the interaction method 200 for implementing personalized speech synthesis model training according to an embodiment of the present invention, and are configured to implement the corresponding modules in the interaction apparatus 800 for implementing personalized speech synthesis model training according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

According to another aspect of the invention, an interactive apparatus for implementing personalized speech synthesis model training is provided. FIG. 10 shows a schematic block diagram of an interactive apparatus 1000 for implementing personalized speech synthesis model training according to one embodiment of the present invention.

As shown in fig. 10, the interactive apparatus 1000 for implementing personalized speech synthesis model training according to the embodiment of the present invention includes a server-side operation module (not explicitly shown), which includes a first sending submodule 1010, a first receiving submodule 1020, an uploading submodule 1030, a second receiving submodule 1040, and a second sending submodule 1050. The various modules may perform the various steps/functions of the interactive method 400 for personalized speech synthesis model training described above in connection with fig. 4, respectively. Only the main functions of the components of the interactive apparatus 1000 for implementing personalized speech synthesis model training will be described below, and the details that have been described above will be omitted.

The first sending sub-module 1010 is configured to send the user training text to the copy client.

The first receiving sub-module 1020 is configured to receive the user sound recording file from the copy client.

The uploading sub-module 1030 is configured to upload the user audio file to a model training server under the condition that the text information included in the user training text matches the text information expressed by the user audio file, so as to train a personalized speech synthesis model, which is specific to the target user, on the model training server based on the user audio file.

The second receiving submodule 1040 is configured to receive training result information of the personalized speech synthesis model from the model training server.

The second sending sub-module 1050 is configured to send the training result information to the duplication client.

According to another aspect of the invention, an interactive system for implementing personalized speech synthesis model training is provided. FIG. 11 shows a schematic block diagram of an interactive system 1100 for implementing personalized speech synthesis model training, according to one embodiment of the present invention. An interactive system 1100 for implementing personalized speech synthesis model training includes a processor 1110 and a memory 1120.

The memory 1120 stores computer program instructions for implementing corresponding steps in the interactive method 400 for personalized speech synthesis model training according to an embodiment of the present invention.

The processor 1110 is configured to execute the computer program instructions stored in the memory 1120 to perform the corresponding steps of the interactive method 400 for implementing personalized speech synthesis model training according to the embodiment of the present invention.

According to another aspect of the present invention, there is provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are configured to perform the corresponding steps of the interaction method 400 for implementing personalized speech synthesis model training according to an embodiment of the present invention, and are configured to implement the corresponding modules in the interaction apparatus 1000 for implementing personalized speech synthesis model training according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

According to another aspect of the present invention, there is provided a training apparatus for personalized speech synthesis models. FIG. 12 shows a schematic block diagram of a training apparatus 1200 for personalized speech synthesis models according to one embodiment of the present invention.

As shown in fig. 12, the training apparatus 1200 for personalized speech synthesis model according to the embodiment of the present invention includes a receiving module 1210, a scheduling module 1220, a pulling module 1230, a running module 1240 and a training module 1250. The various modules may perform the various steps/functions of the method 500 for training a personalized speech synthesis model described above in connection with fig. 5, respectively. Only the main functions of the components of the training apparatus 1200 for personalized speech synthesis model will be described below, and details that have been described above will be omitted.

The receiving module 1210 is configured to receive a user audio file of a target user sent by a copy client directly or via a copy service server.

The scheduling module 1220 is used to schedule the dynamically added specific compute nodes in the container cluster.

The pulling module 1230 is configured to pull a training image from the image repository through the specific computing node, where the training image is generated by a training environment that packages a standard speech synthesis model, and the standard speech synthesis model is trained based on a sample recording file and a corresponding sample text.

The running module 1240 is for running the training image on the particular compute node to expand the training environment.

The training module 1250 is configured to perform speech synthesis model training on the specific computing node by using the training environment and the user audio file to obtain a personalized speech synthesis model specific to the target user.

According to another aspect of the present invention, a training system for personalized speech synthesis models is provided. FIG. 13 shows a schematic block diagram of a training system 1300 for personalized speech synthesis models, according to one embodiment of the present invention. The training system 1300 for personalized speech synthesis models includes a processor 1310 and a memory 1320.

The memory 1320 stores computer program instructions for implementing corresponding steps in the method 500 for training a personalized speech synthesis model according to an embodiment of the present invention.

The processor 1310 is configured to execute the computer program instructions stored in the memory 1320 to perform the corresponding steps of the method 500 for training a personalized speech synthesis model according to an embodiment of the present invention.

According to another aspect of the present invention, there is provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are used for executing the corresponding steps of the training method 500 of the personalized speech synthesis model according to the embodiment of the present invention, and for implementing the corresponding modules in the training apparatus 1200 of the personalized speech synthesis model according to the embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

According to another aspect of the present invention, there is provided a speech synthesis apparatus. Fig. 14 shows a schematic block diagram of a speech synthesis apparatus 1400 according to an embodiment of the present invention.

As shown in fig. 14, the speech synthesis apparatus 1400 according to the embodiment of the present invention includes a receiving module 1410 and a synthesizing module 1420. The various modules may perform the various steps/functions of the speech synthesis method 700 described above in connection with fig. 7, respectively. Only the main functions of the respective components of the speech synthesis apparatus 1400 will be described below, and details that have been described above will be omitted.

The receiving module 1410 is configured to receive a text to be synthesized of a target user sent by a synthesis client directly or via a synthesis service server.

The synthesis module 1420 is configured to input the text to be synthesized into the personalized speech synthesis model of the target user obtained by training according to the training method 500 for speech synthesis, so as to obtain a target speech corresponding to the text to be synthesized.

According to another aspect of the present invention, a speech synthesis system is provided. FIG. 15 shows a schematic block diagram of a speech synthesis system 1500 according to one embodiment of the present invention. The speech synthesis system 1500 includes a processor 1510 and a memory 1520.

The memory 1520 stores computer program instructions for implementing the corresponding steps in the speech synthesis method 700 according to an embodiment of the present invention.

The processor 1510 is configured to execute computer program instructions stored in the memory 1520 to perform the corresponding steps of the speech synthesis method 700 according to an embodiment of the present invention.

According to another aspect of the present invention, there is provided a storage medium on which program instructions are stored, which when executed by a computer or a processor, are used for executing the respective steps of the speech synthesis method 700 of the embodiment of the present invention and for implementing the respective modules in the speech synthesis apparatus 1400 according to the embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of the interactive system for implementing personalized speech synthesis model training, the training system for personalized speech synthesis models or some of the modules in a speech synthesis system according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An interactive method for implementing personalized speech synthesis model training, comprising client operations comprising:

acquiring a user training text from a repeated carving service server;

outputting the user training text;

collecting the voice of a target user to obtain a user recording file;

under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, uploading the user recording file to a model training server directly or through the repeated engraving service server so as to train a personalized voice synthesis model which belongs to the target user on the model training server based on the user recording file;

receiving training result information of the personalized speech synthesis model from the model training server directly or via the duplication service server;

outputting feedback information on whether training of the personalized speech synthesis model is completed based on the training result information.

2. The method of claim 1, wherein the user training text comprises at least one text segment, the user sound recording file comprises at least one qualified speech segment file in one-to-one correspondence with the at least one text segment,

the acquiring voice of the target user to obtain the user recording file comprises:

step a: determining a first text segment in the user training text as a current text segment;

step b: collecting the voice of the target user to obtain a temporary voice section file corresponding to the current text section;

step c: under the condition that the character information contained in the current text segment is matched with the character information expressed by the temporary voice segment file, determining that the temporary voice segment file is a qualified voice segment file corresponding to the current text segment and judging whether the current text segment is the last text segment, if the current text segment is not the last text segment, determining that the next text segment is a new current text segment and transferring to the step b, and if the current text segment is the last text segment, determining that the recording file of the user is completely collected;

step d: under the condition that the text information contained in the current text segment is not matched with the text information expressed by the temporary voice segment file, if a first preset condition is met, discarding the temporary voice segment file and turning to the step b, and if the first preset condition is not met, outputting recording failure prompt information.

3. The method of claim 2, wherein the first preset condition comprises: the number of times step b is performed for the current text segment does not exceed a first number threshold.

4. An interactive method for implementing personalized speech synthesis model training, comprising server-side operations comprising:

sending the user training text to a repeated carving client;

receiving a user recording file from the re-engraving client;

under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, uploading the user recording file to a model training server so as to train a personalized voice synthesis model which is specially owned by the target user on the model training server based on the user recording file;

receiving training result information of the personalized speech synthesis model from the model training server; and

and sending the training result information to the repeated carving client.

5. An interactive apparatus for implementing personalized speech synthesis model training, comprising a client operational module, the client operational module comprising:

the obtaining submodule is used for obtaining a user training text from the repeated carving service server;

the first output submodule is used for outputting the user training text;

the acquisition submodule is used for acquiring the voice of a target user to obtain a user recording file;

the uploading sub-module is used for uploading the user recording file to a model training server directly or through the repeated engraving service server under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, so that a personalized voice synthesis model which is specially owned by the target user is trained on the model training server based on the user recording file;

the receiving submodule is used for receiving the training result information of the personalized speech synthesis model from the model training server directly or through the repeated carving service server;

and the second output submodule is used for outputting feedback information about whether the training of the personalized speech synthesis model is finished or not based on the training result information.

6. An interactive system for implementing personalized speech synthesis model training, comprising a processor and a memory, wherein the memory has stored therein computer program instructions for executing the interactive method for implementing personalized speech synthesis model training according to any of claims 1 to 3 when the computer program instructions are executed by the processor.

7. A storage medium on which program instructions are stored, which program instructions are adapted, when executed, to perform an interactive method for enabling personalized speech synthesis model training according to any of claims 1 to 3.

8. An interactive apparatus for implementing personalized speech synthesis model training, comprising a server-side operation module, the server-side operation module comprising:

the first sending submodule is used for sending the user training text to the repeated engraving client;

the first receiving submodule is used for receiving the user recording file from the repeated carving client;

the uploading sub-module is used for uploading the user recording file to a model training server under the condition that the character information contained in the user training text is matched with the character information expressed by the user recording file, so that a personalized voice synthesis model which is specially owned by the target user is trained on the model training server based on the user recording file;

the second receiving submodule is used for receiving the training result information of the personalized speech synthesis model from the model training server; and

and the second sending submodule is used for sending the training result information to the repeated carving client.

9. An interactive system for enabling personalized speech synthesis model training, comprising a processor and a memory, wherein the memory has stored therein computer program instructions for, when executed by the processor, performing the interactive method for enabling personalized speech synthesis model training according to claim 4.

10. A storage medium on which program instructions are stored, which program instructions are adapted, when executed, to perform the interactive method for enabling personalized speech synthesis model training according to claim 4.