WO2020134896A1

WO2020134896A1 - Method and device for invoking speech synthesis file

Info

Publication number: WO2020134896A1
Application number: PCT/CN2019/122545
Authority: WO
Inventors: 韩喆; 王磊; 傅春霖
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-12-26
Filing date: 2019-12-03
Publication date: 2020-07-02
Also published as: CN110021291A; CN110021291B; TW202027027A

Abstract

A method and a device for invoking a speech synthesis file. Said method comprises: detecting whether a client has a speech synthesis file needed to be used by a registered app (S101), the registered app being a pre-registered app requiring the use of a speech synthesis file; if it is detected that the client does not have a speech synthesis file, downloading, according to a pre-stored speech configuration file corresponding to the registered app, a speech synthesis file from a server side corresponding to the registered app (S103), the speech configuration file including the download address of the speech synthesis file; and if it is detected that the client has the speech synthesis file, invoking the speech synthesis file of the client (S102), so that the registered app performs speech playback according to the speech synthesis file. When a registered app needs to use a speech synthesis file, whether a client has a speech synthesis file is detected, and if the client has a speech synthesis file, the speech synthesis file cached in the client is preferentially invoked, reducing the response time of the entire speech system.

Description

Calling method and device of speech synthesis file

Technical field

This specification relates to the field of computers, and in particular to a method and device for calling a speech synthesis file.

Background technique

With the development of the Internet, multi-party cooperation has been reflected in more and more aspects. When constructing a large-scale voice system, the terminal framework and server are built by the operator, but the application of the terminal requires multiple ISVs (independent software developers) to complete together.

In the existing large-scale speech system, when the APP developed by ISV calls the speech synthesis file for speech playback, the speech synthesis file needs to be synthesized by the server every time, and then the speech synthesis file is downloaded to the terminal for calling. The whole process makes the system The increased response time will severely cause the entire system to be paralyzed, thus affecting the normal operation of the system.

Summary of the invention

The embodiments of the present specification provide a method and a device for calling a speech synthesis file, which solve the problems raised by the background art mentioned above.

To solve the above technical problems, the embodiments of this specification are implemented as follows:

A method for calling a speech synthesis file provided by an embodiment of this specification includes:

Detecting whether there is a voice synthesis file required by the registered APP on the client terminal, and the registered APP is an APP that needs to use a voice synthesis file in advance;

If it is detected that the voice synthesis file does not exist on the client, the voice synthesis file is downloaded from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP. The download address of the speech synthesis file;

If it is detected that the voice synthesis file exists on the client, the voice synthesis file of the client is invoked for the registered APP to perform voice playback according to the voice synthesis file.

Optionally, before detecting whether there is a voice synthesis file required for the registered APP on the client, the method further includes:

Pull the voice configuration file from the server corresponding to the registered APP;

Receiving the voice configuration file delivered by the server corresponding to the registered APP, the distributed voice configuration file includes the server corresponding to the registered APP encrypting the voice configuration file delivered, and then assigning it to The first verification information corresponding to the registered APP;

Determine whether the first verification information matches the second verification information pre-stored by the client;

When it is determined that the first verification information matches the second verification information pre-stored by the client, it is verified that the delivered voice configuration file is correct.

Optionally, determining whether the first verification information matches the second verification information pre-stored by the client specifically includes:

According to the identifier of the registered APP from the pre-stored second verification information corresponding to the registered APP built in the secure operating environment of the client;

Determine whether the first verification information matches the second verification information.

Optionally, before the voice configuration file is pulled from the server corresponding to the registered APP, the method further includes:

Sending voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP trains the training through the built-in voice basic training model APP developer's customized voice model, and input the pre-stored text into the APP developer's customized voice model to generate the voice synthesis file required by the registered APP, the voice basic training model is based on the registered APP To play voice, you need to use a model provided by several voice samples provided in advance, which can be shared by registered APPs.

Optionally, before the registered APP performs voice playback according to the voice synthesis file, the method further includes:

Calculating a first summary value corresponding to the speech synthesis file;

Determine whether the second summary value corresponding to the speech synthesis file pre-stored in the speech configuration file is the same as the first summary value;

If it is determined that the second digest value is the same as the first digest value, the registered APP performs voice playback according to the voice synthesis file.

Optionally, the registered APP performs voice playback according to the voice synthesis file, specifically including: the server corresponding to the registered APP encrypts the voice synthesis file according to a preset rule; the encrypted voice synthesis After the file is decrypted according to the built-in decryption module, the registered APP performs voice playback.

An apparatus for invoking a speech synthesis file provided by an embodiment of this specification, the apparatus includes:

The detection unit is used to detect whether there is a voice synthesis file required by the registered APP on the client terminal, and the registered APP is an APP that needs to use the voice synthesis file in advance;

The downloading unit is configured to download the voice synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP if it is detected that the voice synthesis file does not exist on the client. The configuration file has a built-in download address for the speech synthesis file;

The calling unit is configured to call the voice synthesis file of the client if it is detected that the client has the voice synthesis file, so that the registered APP can perform voice playback according to the voice synthesis file.

Optionally, the device further includes:

A pulling unit, configured to pull the voice configuration file from the server corresponding to the registered APP;

A receiving unit, configured to receive a voice configuration file delivered by a server corresponding to the registered APP, and the voice configuration file delivered includes the server corresponding to the registered APP performing the voice configuration file issued by the server After encryption, it is assigned to the first verification information corresponding to the registered APP;

The judging unit is used to judge whether the first verification information matches the second verification information pre-stored by the client;

The verification unit is configured to verify that the voice configuration file delivered is correct when it is determined that the first verification information matches the second verification information pre-stored by the client.

Optionally, the judgment unit is specifically used to:

Optionally, the device further includes:

The training unit is configured to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP can pass the built-in voice basic training The model trains the APP developer's customized voice model, and generates a speech synthesis file corresponding to the registered APP from the APP developer's customized voice model according to the pre-stored text. The voice basic training model is based on The need for the registered APP to play voice needs to be a model trained by several voice samples provided in advance and can be shared by the registered APP.

Optionally, the device further includes:

A calculation unit, configured to calculate a first summary value corresponding to the speech synthesis file;

The judging unit is further used to judge whether the second summary value corresponding to the speech synthesis file previously stored in the speech configuration file is the same as the first summary value;

If the judgment unit judges that the second digest value is the same as the first digest value, the registered APP performs voice playback according to the voice synthesis file.

A voice system provided by an embodiment of this specification includes a terminal and a server, and the terminal includes a voice SDK running in the terminal, a registered APP, and an APP developer terminal;

The APP developer terminal is used to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP;

The server is used to train the APP developer's customized voice model through the built-in voice basic training model, and input the pre-stored text into the APP developer's customized voice model to generate the registered APP needs. A voice synthesis file, the voice basic training model is a model that is obtained by training a number of voice samples provided in advance according to the needs of the registered APP to play voice and can be shared by registered APPs;

The voice SDK is used to pull the voice configuration file from the server corresponding to the registered APP; receiving the voice configuration file delivered by the server corresponding to the registered APP, the distributed voice configuration file includes all The server corresponding to the registered APP encrypts the delivered voice configuration file and distributes it to the first verification information corresponding to the registered APP; judging the first verification information and the second pre-stored by the client Whether the verification information matches; when judging that the first verification information matches the second verification information pre-stored by the client, verify that the delivered voice configuration file is correct; detect whether the client has a registered APP that needs to be used Voice synthesis file, the registered APP is an APP that needs to be pre-registered and needs to use a voice synthesis file; if it is detected that the voice synthesis file does not exist on the client, the voice configuration file corresponding to the registered APP corresponds to the registered APP Server downloads the speech synthesis file, and the speech configuration file has a built-in download address for the speech synthesis file; if it is detected that the speech synthesis file exists on the client, the speech synthesis file of the client is called for The registered APP performs voice playback according to the voice synthesis file.

A computer-readable medium provided by an embodiment of the present specification has stored thereon computer-readable instructions, and the computer-readable instructions may be executed by a processor to perform the following steps:

An apparatus for calling a speech synthesis file provided by an embodiment of this specification includes a memory for storing computer program instructions and a processor for executing program instructions, where, when the computer program instructions are executed by the processor, Trigger the device to perform the following steps:

The above at least one technical solution adopted by the embodiments of the present specification can achieve the following beneficial effects:

1. When a registered APP needs to use a speech synthesis file, detect whether the client caches the speech synthesis file, and preferentially call the speech synthesis file cached by the client when the client has the speech synthesis file, to reduce the response time of the entire speech system;

2. The APP developer can train the APP developer's customized voice model through the server corresponding to the registered APP, and then input the pre-stored text into the APP developer's customized voice model to generate the APP developer's voice synthesis file. , When the registered APP needs to use the speech synthesis file, download the corresponding speech synthesis file for the registered APP to play voice;

3. The voice system can support multiple registered APPs, so that the utilization rate of the voice system is fully utilized.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present specification or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are some of the embodiments described in this specification. For those of ordinary skill in the art, without paying any creative labor, other drawings can also be obtained based on these drawings.

FIG. 1 is a schematic flowchart of a method for invoking a speech synthesis file provided in Embodiment 1 of the present specification;

2 is a schematic flowchart of a method for invoking a speech synthesis file provided in Embodiment 2 of this specification;

FIG. 3 is a schematic structural diagram of an apparatus for invoking a speech synthesis file provided in Embodiment 3 of this specification;

FIG. 4 is a schematic structural diagram of a voice system provided in Embodiment 4 of the present specification.

detailed description

In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be described clearly and completely in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments of this specification, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of this application.

FIG. 1 is a schematic flowchart of a method for calling a speech synthesis file provided by an embodiment of the present specification. The schematic flowchart includes:

In step S101, it is detected whether the client has a voice synthesis file required for the registered APP, if it exists, step S102 is executed, and if it does not exist, step S103 is executed.

In step S101 of the embodiment of the present specification, the step of detecting whether there is a voice synthesis file required by the registered APP on the client can be performed by the voice SDK. The voice SDK is provided with an interface for connecting multiple APPs at the same time, that is, the APP performs to the voice SDK Registration is to connect the APP data to the voice SDK. The registered APP is an application that is registered with the voice SDK in advance and requires a voice synthesis file. In this embodiment, the voice SDK is a framework for APP developers when developing software.

In step S101 of the embodiment of the present specification, the speech synthesis file is trained by the server corresponding to the registered APP according to the needs of the APP developer. First, the APP developer sends the voice data provided by the APP developer to reflect the characteristics of the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP trains the APP developer to customize the custom through the built-in voice basic training model. Voice model, and input the pre-stored text into the APP developer's customized voice model to generate the voice synthesis file required by the registered APP. The basic voice training model is a model that can be shared by registered APPs and is trained by using several voice samples provided in advance according to the needs of registered APPs to play voices. Among them, some voice samples are high-quality voice data stored on the server corresponding to the registered APP.

Further, in step S101 of the embodiment of the present specification, the voice basic training model determines the sampling time of high-quality voice data according to the accuracy of the entire voice system. When the accuracy required by the entire voice system is high, the The sampling time can be 300 hours, but when the accuracy required by the entire voice system is not high, the sampling time of high-quality voice data is selected to be 100 hours.

In step S101 of the embodiment of the present specification, after the server corresponding to the registered APP trains the voice basic training model, the APP developer uploads voice data reflecting the characteristics of the APP developer to the server corresponding to the registered APP, through the voice basis The training model trains a customized voice model for APP developers. The voice data reflecting the characteristics of the APP developer is the voice data recorded according to the language environment required by the APP developer. At this time, the APP developer only needs to upload a small amount of uploaded voice data to the server corresponding to the registered APP. Among them, the voice basic training model can be understood as an intermediate model with a large data set provided by the server corresponding to the registered APP to the APP developer, and then the intermediate model is tuned for the voice data uploaded by the APP developer to obtain training A customized voice model reflecting the characteristics of APP developers.

In step S101 of the embodiment of the present specification, the voice data uploaded by the APP developer needs to be reviewed. After generating a customized voice model reflecting the characteristics of the APP developer, the management personnel of the voice system conducts the review. The mechanism can be that the customized voice model that reflects the characteristics of the APP developer can be used normally after being approved. That is to say, even if a customized voice model that reflects the characteristics of the APP developer is generated but has not been approved by the reviewer, the The customized voice model reflecting the characteristics of the APP developer cannot be used normally; at the same time, the audit mechanism can also be that regardless of whether the audit result of the customized voice model reflecting the characteristics of the APP developer passes, the registered APP can be normal. Used, but once the reviewer detects that the customized voice model reflecting the characteristics of the APP developer is unqualified, the customized voice model reflecting the characteristics of the APP developer becomes invalid.

In step S101 of the embodiment of the present specification, if the APP developer does not adopt this solution, but uses a traditional method to achieve customization requirements. One is that the APP developer directly uploads the voice data reflecting the characteristics of the APP developer. After any processing, this makes the robustness low; the second is that the APP developer separately produces a customized voice model that reflects the characteristics of the APP developer. This process takes a long time to execute, and it cannot guarantee the customized voice model. quality.

In step S101 of the embodiment of the present specification, the voice system can also be applied to a video system, that is, the video basic training model is stored in the server corresponding to the registered APP.

Step S102: Invoking the speech synthesis file of the client.

In the embodiment S102 of this specification, when a registered voice SDK has an application with a voice synthesis file that needs to be used, the voice SDK first detects whether the client exists. When the client has a configuration file that needs to be called, the call is stored on the client. Voice synthesis file, registered APP can play voice according to the voice synthesis file.

Step S103: Download the voice synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP.

In step S103 of the embodiment of the present specification, the speech synthesis file is generated according to a pre-stored text and a customized speech model by the APP developer. If the speech synthesis file does not exist during the judgment in step S102, it means that the speech synthesis file has never been downloaded by the registered APP before.

In step S103 of the embodiment of the present specification, the voice configuration file has a built-in download address of the voice synthesis file, and the registered APP downloads the required voice synthesis file according to the download address of the voice synthesis file for the registered APP to synthesize the voice File for voice playback.

In step S103 of the embodiment of the present specification, before the registered APP performs voice playback according to the voice synthesis file, the voice synthesis file also needs to be verified, and the specific steps may be:

Step 1. Calculate the first summary value corresponding to the speech synthesis file.

In step 1 of the embodiment of the present specification, the first digest value corresponding to the speech synthesis file checks the parameter value of whether the downloaded speech synthesis file has an error, or whether the downloaded speech synthesis file has been tampered with. In this embodiment, MD5 digest can be used for implementation. MD5 is a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value to ensure downloading. Whether the voice configuration file of the Internet is wrong, or whether the downloaded voice configuration file has been tampered with. For example, under Unix, many softwares have a file with the same file name and a file extension of .md5 when downloaded. There is usually only one line of text in this file, and the general structure is as follows:

MD5(tanajiya.tar.gz)=38b8c2c1093dd0fec383a9d9ac940515

This is the digital signature of the tanajiya.tar.gz file. MD5 treats the entire file as a large text message, and through its irreversible string transformation algorithm, produces this unique MD5 message digest. In layman's terms, anyone on the planet has their own unique fingerprint, which is often the most trusted method for the judiciary to identify criminals; similarly, MD5 can generate a file for any file (regardless of its size, format, or number) The same unique "digital fingerprint", if anyone makes any changes to the file, its MD5 value, that is, the corresponding "digital fingerprint" will change. Download the MD5 in the site, its role is that after downloading the file, we can do a MD5 check on the downloaded file with special software (such as Windows MD5 Check, etc.) to ensure that the file we obtained is the same as that provided by the site The file is the same file. Specifically, the MD5 value of the file is like the "digital fingerprint" of the file. The MD5 value of each file is different. If anyone makes any changes to the file, the MD5 value of the corresponding "digital fingerprint" will change. For example, the download server provides an MD5 value for a file in advance. After the user downloads the file, the MD5 value of the downloaded file is recalculated. By comparing whether the two values are the same, you can determine whether the downloaded file is wrong, or the downloaded file Has it been tampered with?

In step 1 of the embodiment of the present specification, calculating the first summary value is to check whether the downloaded speech synthesis file has an error, or whether the downloaded speech synthesis file has been tampered with, so as to realize real-time detection of the speech synthesis file error, once the speech synthesis file If an error occurs in the content, the error message will be reported intuitively to prevent the error from spreading in the application. In addition, the check for detecting speech synthesis files can also be implemented using SHA256 digests.

Step 2: Determine whether the second digest value corresponding to the pre-stored voice synthesis file in the voice configuration file is the same as the first digest value. If they are the same, perform step 3; if they are not the same, return to step S103.

Step 3: The registered APP performs voice playback according to the voice synthesis file.

In step 3 of the embodiment of this specification, the server corresponding to the registered APP can be encrypted according to the built-in private key. When playing the encrypted voice synthesis file, it needs to be decrypted according to the public key stored in the decryption module and then play the voice.

In step S103 of the embodiment of the present specification, a general voice database is configured in the voice basic training model, and the general voice database includes voice broadcasts of transaction amount and time, that is, the APP developer customizes when entering numbers in the text. The speech model can be directly converted into a transaction amount of speech or a time speech synthesis file, rather than a simple digital reading. For example, when the text is written at 5:00, the speech played in the speech synthesis file is 5 o'clock.

In the above steps, when a registered APP needs to use a speech synthesis file, it detects whether the client caches the speech synthesis file, and preferentially calls the speech synthesis file cached by the client when the client has the speech synthesis file to reduce the response time of the entire speech system .

Further, in order that the voice system can be applied in a secure environment, changes are made to the above embodiments. FIG. 2 is a schematic flowchart of a method for calling a speech synthesis file provided by an embodiment of the present specification. The schematic flowchart includes:

Step S201: Pull the voice configuration file from the server corresponding to the registered APP.

In step S201 of the embodiment of the present specification, the customized speech model corresponding to the registered APP converts the pre-stored text into a speech synthesis file, and the speech configuration file corresponding to the registered APP includes the speech list of the speech synthesis file.

Step S202: Receive the voice configuration file delivered by the server corresponding to the registered APP. The delivered voice configuration file includes the server corresponding to the registered APP encrypts the delivered voice configuration file and assigns it to the corresponding 1. Verification information.

In step S202 of the embodiment of this specification, the developer APP registers with the voice SDK, and the voice SDK is connected with a decryption module. The decryption module can issue a decrypted public key through TSM. The public key corresponds to the registered APP. The unique public key, the server is configured with a corresponding private key, and the server corresponding to the registered APP encrypts the voice configuration file delivered by the private key. The public key and private key are a key pair, the public key is the public part of the key pair, and the private key is the non-public part. The key pair composed of the public key and the private key can be guaranteed to be unique. When using this key pair, if you use one of the keys to encrypt a piece of data, you must use the other key to decrypt it. For example, if the public key is used to encrypt data, the private key must be used to decrypt. If the private key is used to encrypt data, the public key must also be used to decrypt, otherwise the decryption will not succeed.

Further, in step S202 of the embodiment of the present specification, the decryption module may be an SE module, and the SE module is a module that ensures system security. The security chip and the chip operating system (COS) are used to implement functions such as secure storage of data and encryption and decryption operations. The main functions of the SE module in the security system include: secure storage of keys, data encryption operations, and secure storage of information. The secure storage of keys can establish a relatively complete key management system to ensure that keys cannot be read. Data encryption operations include support for reliable security algorithms, sensitive data ciphertext transmission, and data transmission tamper resistance. The safe storage of information refers to a strict file access authority mechanism and reliable authentication algorithms and processes. In this embodiment, the public key is placed in the SE module. SE modules can be packaged in various forms, common ones include smart cards and embedded security modules (eSE). In this embodiment, an embedded security module (eSE) can be implanted for the voice SDK of the voice system, and a smart security chip that meets the requirements of CCEAL5+ security level is used. The built-in security operating system satisfies the terminal’s security key storage and data encryption services. demand. The voice system can be widely used in finance, map navigation, urban transportation, medical treatment, retail and other fields, and can protect the security of the system when it is used.

In step S203, it is determined whether the first verification information matches the second verification information pre-stored by the client. If so, step S204 is executed, and if not, the process ends.

In step S203 of the embodiment of the present specification, according to the identifier of the registered APP, the second verification information corresponding to the registered APP pre-stored in the secure operating environment of the client is determined; it is determined whether the first verification information and the second verification information match. The identity of the registered APP is the identity information of the registered APP.

Step S204, verify that the delivered voice configuration file is correct.

In step S205, it is detected whether the client has a voice synthesis file required for the registered APP, if it exists, step S206 is executed, and if it does not exist, step S207 is executed.

In step S205 in the embodiment of the present specification, it is the same as step S101 described above, and is not repeated here.

Step S206, calling the speech synthesis file of the client.

In step S206 in the embodiment of the present specification, it is the same as the above step S102, and is not repeated here.

Step S207: Download the voice synthesis file from the server corresponding to the registered APP according to the voice configuration file corresponding to the registered APP.

In step S207 of the embodiment of the present specification, it is the same as the above step S103, and is not repeated here.

Further, the voice system in this embodiment also has a synchronization problem between the server and the registered APP. In order to solve this problem, the server can support the active push method, that is, when the client's voice synthesis file changes, the server actively sends The client pushes.

FIG. 3 is a schematic structural diagram of a calling device for a speech synthesis file provided by an embodiment of the present specification. The schematic structural diagram includes: a detecting unit 1, a calling unit 2, a downloading unit 3, a pulling unit 4, a receiving unit 5, and a judging unit 6 , Verification unit 7, training unit 8 and calculation unit 9.

The detection unit 1 is used to detect whether there is a voice synthesis file required by the registered APP on the client terminal, and the registered APP is an APP that needs to use the voice synthesis file in advance.

The calling unit 2 is used to call the voice synthesis file of the client if it is detected that the client has a voice synthesis file, so that the registered APP can perform voice playback according to the voice synthesis file.

The downloading unit 3 is used to download the voice synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP if the voice synthesis file does not exist on the client terminal. download link.

The pulling unit 4 is used to pull the voice configuration file from the server corresponding to the registered APP.

The receiving unit 5 is used to receive the voice configuration file delivered by the server corresponding to the registered APP. The delivered voice configuration file includes the server corresponding to the registered APP encrypting the delivered voice configuration file and assigning it to the registered APP. First verification information.

The judging unit 6 is used to judge whether the first verification information matches the second verification information pre-stored by the client;

The verification unit 7 is configured to verify that the delivered voice configuration file is correct when it is determined that the first verification information matches the second verification information pre-stored by the client.

The judgment unit 6 is specifically used for:

The second verification information corresponding to the registered APP pre-stored in the secure operating environment of the client according to the identifier of the registered APP;

The training unit 8 is used to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP trains the APP developer to customize through the built-in voice basic training model The voice model, and the voice synthesis file corresponding to the registered APP is generated by the APP-developed voice model according to the pre-stored text. The basic voice training model is based on the needs of the registered APP to play voice. The obtained model can be used by registered APPs.

The calculation unit 9 is used to calculate the first summary value corresponding to the speech synthesis file;

The judging unit 6 is also used to judge whether the second digest value corresponding to the pre-stored voice synthesis file in the voice configuration file is the same as the first digest value;

If the judging unit 6 judges that the second digest value is the same as the first digest value, the registered APP performs voice playback according to the voice synthesis file.

The registered APP performs voice playback according to the speech synthesis file, including: the server corresponding to the registered APP encrypts the speech synthesis file according to the preset rules; the encrypted speech synthesis file is decrypted according to the built-in decryption module, and the registered APP performs speech Play.

The embodiments of the present specification also provide a computer-readable medium on which computer-readable instructions are stored. The computer-readable instructions can be executed by a processor to perform the following steps:

Detect whether there is a voice synthesis file required by the registered APP on the client, and the registered APP is an APP that needs to use the voice synthesis file in advance;

If it is detected that the client does not have a voice synthesis file, download the voice synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP, and the voice configuration file has a built-in download address for the voice synthesis file;

If it is detected that the client has a voice synthesis file, the client's voice synthesis file is called to allow the registered APP to perform voice playback according to the voice synthesis file.

An embodiment of the present specification also provides a calling device for a speech synthesis file. The device includes a memory for storing computer program instructions and a processor for executing the program instructions, wherein, when the computer program instructions are executed by the processor, Trigger the device to perform the following steps:

The detection unit is used to detect whether there is a voice synthesis file required by the registered APP on the client, and the registered APP is an APP that needs to use the voice synthesis file in advance;

The downloading unit is used to download the voice synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP if the voice synthesis file does not exist on the client terminal. download link;

The calling unit is used to call the voice synthesis file of the client if it is detected that the client has a voice synthesis file, so that the registered APP can perform voice playback according to the voice synthesis file.

The server is used to train the APP developer's customized voice model through the built-in voice basic training model, and enter the pre-stored text into the APP developer's customized voice model to generate the voice synthesis file required by the registered APP. The training model is a model that can be shared by the registered APPs and trained by using several voice samples provided in advance according to the needs of the registered APP to play voices;

The voice SDK is used to pull the voice configuration file from the server corresponding to the registered APP; receive the voice configuration file delivered by the server corresponding to the registered APP. The issued voice configuration file includes the server to the corresponding APP After the voice configuration file is encrypted, it is assigned to the first verification information corresponding to the registered APP; it is determined whether the first verification information matches the second verification information pre-stored by the client; When the second verification information matches, verify that the delivered voice configuration file is correct; detect whether the client has the voice synthesis file required for the registered APP, and the registered APP is an application that requires the voice synthesis file to be registered in advance; if it is detected There is no voice synthesis file on the client. Download the voice synthesis file from the server corresponding to the registered APP according to the voice configuration file corresponding to the registered APP. The voice configuration file has a built-in download address for the voice synthesis file; if it is detected that there is voice synthesis on the client File, call the voice synthesis file of the client for the registered APP to play voice according to the voice synthesis file

Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.

The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each flow and/or block in the flowchart and/or block diagram and a combination of the flow and/or block in the flowchart and/or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a computer, dedicated computer, embedded processor, or other programmable data processing device to produce a machine so that the instructions executed by the processor of the computer or other programmable data processing device produce instructions for A device for realizing the functions specified in one block or multiple blocks in one flow or multiple flows in a flowchart

These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer-readable media, such as read only memory (ROM) or flash memory (flashRAM). Memory is an example of computer-readable media.

Computer readable media, including permanent and non-permanent, removable and non-removable media, can store information by any method or technology. The information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. As defined in this article, computer-readable media does not include temporary computer-readable media (transitory media), such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device that includes a series of elements not only includes those elements, but also includes Other elements not explicitly listed, or include elements inherent to such processes, methods, goods, or equipment. In the absence of more restrictions, the elements defined by the sentence "include one..." do not exclude that there are other identical elements in the process, method, commodity or equipment that includes the elements.

The above are only examples of this specification, and are not intended to limit this specification. For those skilled in the art, this specification may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of this specification shall be included in the scope of the claims of this specification.

Claims

A method for calling a speech synthesis file, characterized in that the method includes:

Detecting whether there is a voice synthesis file required by the registered APP on the client terminal, and the registered APP is an APP that needs to use a voice synthesis file in advance;

If it is detected that the voice synthesis file does not exist on the client, the voice synthesis file is downloaded from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP. The download address of the speech synthesis file;

If it is detected that the voice synthesis file exists on the client, the voice synthesis file of the client is invoked for the registered APP to perform voice playback according to the voice synthesis file.
The method for invoking a speech synthesis file according to claim 1, wherein before the detecting whether there is a speech synthesis file required for the registered APP on the client, the method further comprises:

Pull the voice configuration file from the server corresponding to the registered APP;

Receiving the voice configuration file delivered by the server corresponding to the registered APP, the distributed voice configuration file includes the server corresponding to the registered APP encrypting the voice configuration file delivered, and then assigning it to The first verification information corresponding to the registered APP;

Determine whether the first verification information matches the second verification information pre-stored by the client;

When it is determined that the first verification information matches the second verification information pre-stored by the client, it is verified that the delivered voice configuration file is correct.
The method for calling a speech synthesis file according to claim 2, wherein determining whether the first verification information matches the second verification information stored in advance by the client specifically includes:

According to the identifier of the registered APP from the pre-stored second verification information corresponding to the registered APP built in the secure operating environment of the client;

Determine whether the first verification information matches the second verification information.
The method for invoking a voice synthesis file according to claim 2, wherein before the voice configuration file is pulled from the server corresponding to the registered APP, the method further comprises:

Sending voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP trains the training through the built-in voice basic training model APP developer's customized voice model, and input the pre-stored text into the APP developer's customized voice model to generate the voice synthesis file required by the registered APP, the voice basic training model is based on the registered APP To play voice, you need to use a model provided by several voice samples provided in advance, which can be shared by registered APPs.
The method for calling a speech synthesis file according to claim 1, wherein before the registered APP performs speech playback according to the speech synthesis file, the method further comprises:

Calculating a first summary value corresponding to the speech synthesis file;

Determine whether the second summary value corresponding to the speech synthesis file pre-stored in the speech configuration file is the same as the first summary value;

If it is determined that the second digest value is the same as the first digest value, the registered APP performs voice playback according to the voice synthesis file.
The method for invoking a voice synthesis file according to claim 1, wherein the registered APP performs voice playback according to the voice synthesis file, which specifically includes: the server corresponding to the registered APP performs The voice synthesis file is encrypted; after the encrypted voice synthesis file is decrypted according to the built-in decryption module, the registered APP performs voice playback.
A voice synthesis file calling device, characterized in that the device includes:

The detection unit is used to detect whether there is a voice synthesis file required by the registered APP on the client terminal, and the registered APP is an APP that needs to use the voice synthesis file in advance;

The downloading unit is configured to download the voice synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP if it is detected that the voice synthesis file does not exist on the client. The configuration file has a built-in download address for the speech synthesis file;

The calling unit is configured to call the voice synthesis file of the client if it is detected that the client has the voice synthesis file, so that the registered APP can perform voice playback according to the voice synthesis file.
The apparatus for calling a speech synthesis file according to claim 7, wherein the apparatus further comprises:

A pulling unit, configured to pull the voice configuration file from the server corresponding to the registered APP;

A receiving unit, configured to receive a voice configuration file delivered by a server corresponding to the registered APP, and the voice configuration file delivered includes the server corresponding to the registered APP performing the voice configuration file issued by the server After encryption, it is assigned to the first verification information corresponding to the registered APP;

The judging unit is used to judge whether the first verification information matches the second verification information pre-stored by the client;

The verification unit is configured to verify that the voice configuration file delivered is correct when it is determined that the first verification information matches the second verification information pre-stored by the client.
The calling device of a speech synthesis file according to claim 8, wherein the judgment unit is specifically configured to:

Second verification information corresponding to the registered APP pre-stored in the secure operating environment of the client according to the identifier of the registered APP;

Determine whether the first verification information matches the second verification information.
The apparatus for calling a speech synthesis file according to claim 8, wherein the apparatus further comprises:

The training unit is configured to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP can pass the built-in voice basic training The model trains the APP developer's customized voice model, and generates a speech synthesis file corresponding to the registered APP from the APP developer's customized voice model according to the pre-stored text. The voice basic training model is based on The need for the registered APP to play voice needs to be a model trained by several voice samples provided in advance and can be shared by the registered APP.
The apparatus for calling a speech synthesis file according to claim 7, wherein the apparatus further comprises:

A calculation unit, configured to calculate a first summary value corresponding to the speech synthesis file;

The judging unit is also used to judge whether the second summary value corresponding to the speech synthesis file pre-stored in the speech configuration file is the same as the first summary value;

If the judgment unit judges that the second digest value is the same as the first digest value, the registered APP performs voice playback according to the voice synthesis file.
The apparatus for calling a speech synthesis file according to claim 7, wherein the registered APP performs speech playback according to the speech synthesis file, specifically including:

The server corresponding to the registered APP encrypts the voice synthesis file according to a preset rule; after the encrypted voice synthesis file is decrypted according to the built-in decryption module, the registered APP performs voice playback.
A voice system, including a terminal and a server, the terminal includes a voice SDK running in the terminal, a registered APP, and an APP developer terminal;

The APP developer terminal is used to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP;

The server is used to train the APP developer's customized voice model through the built-in voice basic training model, and input the pre-stored text into the APP developer's customized voice model to generate the registered APP needs. A voice synthesis file, the voice basic training model is a model that is obtained by training a number of voice samples provided in advance according to the needs of the registered APP to play voice and can be shared by registered APPs;

The voice SDK is used to pull the voice configuration file from the server corresponding to the registered APP; receiving the voice configuration file delivered by the server corresponding to the registered APP, the distributed voice configuration file includes all The server corresponding to the registered APP encrypts the delivered voice configuration file and distributes it to the first verification information corresponding to the registered APP; judging the first verification information and the second pre-stored by the client Whether the verification information matches; when judging that the first verification information matches the second verification information pre-stored by the client, verify that the delivered voice configuration file is correct; detect whether the client has a registered APP that needs to be used Voice synthesis file, the registered APP is an APP that needs to be pre-registered and needs to use a voice synthesis file; if it is detected that the voice synthesis file does not exist on the client, the voice configuration file corresponding to the registered APP corresponds to the registered APP Server downloads the speech synthesis file, and the speech configuration file has a built-in download address for the speech synthesis file; if it is detected that the speech synthesis file exists on the client, the speech synthesis file of the client is called for The registered APP performs voice playback according to the voice synthesis file.
A computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the method of any one of claims 1 to 6.
A voice synthesis file calling device, the device includes a memory for storing computer program instructions and a processor for executing the program instructions, wherein when the computer program instructions are executed by the processor, the device is triggered to execute the claims The method according to any one of 1 to 6.