TW202027027A

TW202027027A - Method and device for invoking voice synthesis file

Info

Publication number: TW202027027A
Application number: TW108137036A
Authority: TW
Inventors: 韓喆; 王磊; 傅春霖
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2018-12-26
Filing date: 2019-10-15
Publication date: 2020-07-16
Also published as: CN110021291A; WO2020134896A1; CN110021291B

Abstract

The invention discloses a method and device for invoking a voice synthesis file. The method includes the steps that whether a client-side has the voice synthesis file which is required to be used by aregistered APP or not is detected, and the registered APP is an APP which needs to use the voice synthesis file and needs to be registered in advance; when it is detected that the client-side does not have the voice synthesis file, the voice synthesis file is downloaded from a server-side corresponding to the registered APP according to a pre-stored voice configuration profile corresponding to the registered APP, and the voice configuration profile has a download address of the voice synthesis file; and if it is detected that the client-side has the voice synthesis file, the voice synthesis file of the client-side is invoked to enable the registered APP to perform voice playing according to the voice synthesis file. When the registered APP needs to use the voice synthesis file, it is detected whether the client has the voice synthesis file or not, and the voice synthesis file buffered by the client-side is preferentially invoked when the voice synthesis file exists on the client-side, and the response time of the entire voice system is shortened.

Description

Method and device for calling speech synthesis files

本說明書涉及電腦領域，尤其是涉及一種語音合成檔案的調用方法及裝置。This manual relates to the computer field, and in particular to a method and device for calling speech synthesis files.

隨著網際網路的發展，多方合作已經體現在越來越多的方面。建設一個大型語音系統時，終端的框架和伺服端由運行商進行搭建，但終端的應用需要多個ISV(獨立軟體開發商)來共同完成。現有的大型語音系統中，ISV開發的APP調用語音合成檔案進行語音播放時，每次都需要由伺服端合成該語音合成檔案，再將該語音合成檔案下載至終端進行調用，整個過程使得系統的回應時間增加，嚴重的還會造成整個系統的癱瘓，從而影響系統的正常運行。With the development of the Internet, multi-party cooperation has been reflected in more and more aspects. When building a large-scale voice system, the terminal frame and server are built by the operator, but the terminal application requires multiple ISVs (Independent Software Developers) to jointly complete. In the existing large-scale speech system, when the APP developed by ISV calls the speech synthesis file for speech playback, the server needs to synthesize the speech synthesis file every time, and then download the speech synthesis file to the terminal for calling. The whole process makes the system The increase in response time will cause the paralysis of the entire system, which will affect the normal operation of the system.

本說明書實施例提供一種語音合成檔案的調用方法及裝置，解決了上述先前技術提出的問題。為解決上述技術問題，本說明書實施例是這樣實現的：本說明書實施例提供的一種語音合成檔案的調用方法，該方法包括：檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，所述已註冊APP為預先註冊需要使用語音合成檔案的APP；若檢測出客戶端不存在所述語音合成檔案，根據預先儲存的已註冊APP對應的語音配置檔案從所述已註冊APP對應的伺服端下載所述語音合成檔案，所述語音配置檔案內建有所述語音合成檔案的下載位址；若檢測出客戶端存在所述語音合成檔案，調用客戶端的所述語音合成檔案，以供所述已註冊APP根據所述語音合成檔案進行語音播放。可選的，所述檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案之前，所述方法還包括：向所述已註冊APP對應的伺服端拉取所述語音配置檔案；接收所述已註冊APP對應的伺服端下發的語音配置檔案，下發的所述語音配置檔案包括所述已註冊APP對應的伺服端對下發的所述語音配置檔案進行加密後，分配給所述已註冊APP對應的第一驗證資訊；判斷所述第一驗證資訊與客戶端預先保存的第二驗證資訊是否匹配；在判斷出所述第一驗證資訊與客戶端預先保存的第二驗證資訊匹配時，則驗證下發的所述語音配置檔案正確。可選的，所述判斷所述第一驗證資訊與客戶端預先保存的第二驗證資訊是否匹配，具體包括：根據所述已註冊APP的標識從內建於客戶端安全運行環境中預先保存的與所述已註冊APP對應的第二驗證資訊；判斷所述第一驗證資訊與第二驗證資訊是否匹配。可選的，所述向所述已註冊APP對應的伺服端拉取所述語音配置檔案之前，所述方法還包括：向所述已註冊APP對應的伺服端發送所述APP開發者提供的反映所述APP開發者特徵的語音資料，以便所述已註冊APP對應的伺服端透過內建的語音基礎訓練模型訓練出所述APP開發者定制化的語音模型，並將預先儲存的文本輸入所述APP開發者定制化的語音模型以生成已註冊APP需要的語音合成檔案，所述語音基礎訓練模型為根據所述已註冊APP播放語音的需要利用預先提供的若干語音樣本訓練得到的、可供已註冊APP共用的模型。可選的，所述已註冊APP根據所述語音合成檔案進行語音播放之前，所述方法還包括：計算所述語音合成檔案對應的第一摘要值；判斷根據所述語音配置檔案內預先儲存的所述語音合成檔案對應的第二摘要值與所述第一摘要值是否相同；若判斷出所述第二摘要值與所述第一摘要值相同時，則所述已註冊APP根據所述語音合成檔案進行語音播放。可選的，所述已註冊APP根據所述語音合成檔案進行語音播放，具體包括：所述已註冊APP對應的伺服端根據預設規則對所述語音合成檔案進行加密；所述加密的語音合成檔案根據內建解密模組解密後，由所述已註冊APP進行語音播放。本說明書實施例提供的一種語音合成檔案的調用裝置，所述裝置包括：檢測單元，用於檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，所述已註冊APP為預先註冊需要使用語音合成檔案的APP；下載單元，用於若檢測出客戶端不存在所述語音合成檔案，根據預先儲存的已註冊APP對應的語音配置檔案從所述已註冊APP對應的伺服端下載所述語音合成檔案，所述語音配置檔案內建有所述語音合成檔案的下載位址；調用單元，用於若檢測出客戶端存在所述語音合成檔案，調用客戶端的所述語音合成檔案，以供所述已註冊APP根據所述語音合成檔案進行語音播放。可選的，所述裝置還包括：拉取單元，用於向所述已註冊APP對應的伺服端拉取所述語音配置檔案；接收單元，用於接收所述已註冊APP對應的伺服端下發的語音配置檔案，下發的所述語音配置檔案包括所述已註冊APP對應的伺服端對下發的所述語音配置檔案進行加密後，分配給所述已註冊APP對應的第一驗證資訊；判斷單元，用於判斷所述第一驗證資訊與客戶端預先保存的第二驗證資訊是否匹配；驗證單元，用於在判斷出所述第一驗證資訊與客戶端預先保存的第二驗證資訊匹配時，驗證下發的所述語音配置檔案正確。可選的，所述判斷單元具體用於：根據所述已註冊APP的標識從內建於客戶端安全運行環境中預先保存的與所述已註冊APP對應的第二驗證資訊；判斷所述第一驗證資訊與第二驗證資訊是否匹配。可選的，所述裝置還包括：訓練單元，用於向所述已註冊APP對應的伺服端發送所述APP開發者提供的反映所述APP開發者特徵的語音資料，以便所述已註冊APP對應的伺服端透過內建的語音基礎訓練模型訓練出所述APP開發者定制化的語音模型，並根據預先儲存的文本由所述APP開發者定制化的語音模型生成已註冊APP對應的語音合成檔案，所述語音基礎訓練模型為根據所述已註冊APP播放語音的需要利用預先提供的若干語音樣本訓練得到的、可供已註冊APP共用的模型。可選的，所述裝置還包括：計算單元，用於計算所述語音合成檔案對應的第一摘要值；所述判斷單元還用於判斷根據所述語音配置檔案內預先儲存的所述語音合成檔案對應的第二摘要值與所述第一摘要值是否相同；所述判斷單元若判斷出所述第二摘要值與所述第一摘要值相同時，則所述已註冊APP根據所述語音合成檔案進行語音播放。可選的，所述已註冊APP根據所述語音合成檔案進行語音播放，具體包括：所述已註冊APP對應的伺服端根據預設規則對所述語音合成檔案進行加密；所述加密的語音合成檔案根據內建解密模組解密後，由所述已註冊APP進行語音播放。本說明書實施例提供的一種語音系統，包括終端、伺服器，終端包括運行在終端中的語音SDK、已註冊APP以及APP開發者端；所述APP開發者端用於向所述已註冊APP對應的伺服端發送所述APP開發者提供的反映所述APP開發者特徵的語音資料；所述伺服端用於透過內建的語音基礎訓練模型訓練出所述APP開發者定制化的語音模型，並將預先儲存的文本輸入所述APP開發者定制化的語音模型以生成已註冊APP需要的語音合成檔案，所述語音基礎訓練模型為根據所述已註冊APP播放語音的需要利用預先提供的若干語音樣本訓練得到的、可供已註冊APP共用的模型；所述語音SDK用於向所述已註冊APP對應的伺服端拉取所述語音配置檔案；接收所述已註冊APP對應的伺服端下發的語音配置檔案，所述下發語音配置檔案包括所述已註冊APP對應的伺服端對所述下發的語音配置檔案進行加密後，分配給所述已註冊APP對應的第一驗證資訊；判斷所述第一驗證資訊與客戶端預先保存的第二驗證資訊是否匹配；在判斷出所述第一驗證資訊與客戶端預先保存的第二驗證資訊匹配時，則驗證所述下發的語音配置檔案正確;檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，所述已註冊APP為預先註冊需要使用語音合成檔案的APP；若檢測出客戶端不存在所述語音合成檔案，根據已註冊APP對應的語音配置檔案從所述已註冊APP對應的伺服端下載所述語音合成檔案，所述語音配置檔案內建有所述語音合成檔案的下載位址;若檢測出客戶端存在所述語音合成檔案，調用客戶端的所述語音合成檔案，以供所述已註冊APP根據所述語音合成檔案進行語音播放。本說明書實施例提供的一種電腦可讀媒體，其上儲存有電腦可讀指令，所述電腦可讀指令可被處理器執行以下步驟：檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，所述已註冊APP為預先註冊需要使用語音合成檔案的APP；若檢測出客戶端不存在所述語音合成檔案，根據預先儲存的已註冊APP對應的語音配置檔案從所述已註冊APP對應的伺服端下載所述語音合成檔案，所述語音配置檔案內建有所述語音合成檔案的下載位址；若檢測出客戶端存在所述語音合成檔案，調用客戶端的所述語音合成檔案，以供所述已註冊APP根據所述語音合成檔案進行語音播放。本說明書實施例提供的一種語音合成檔案的調用設備，該設備包括用於儲存電腦程式指令的記憶體和用於執行程式指令的處理器，其中，當該電腦程式指令被該處理器執行時，觸發該設備執行以下步驟：檢測單元，用於檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，所述已註冊APP為預先註冊需要使用語音合成檔案的APP；下載單元，用於若檢測出客戶端不存在所述語音合成檔案，根據預先儲存的已註冊APP對應的語音配置檔案從所述已註冊APP對應的伺服端下載所述語音合成檔案，所述語音配置檔案內建有所述語音合成檔案的下載位址；調用單元，用於若檢測出客戶端存在所述語音合成檔案，調用客戶端的所述語音合成檔案，以供所述已註冊APP根據所述語音合成檔案進行語音播放。本說明書實施例採用的上述至少一個技術方案能夠達到以下有益效果： 1、已註冊APP需要使用語音合成檔案時，檢測客戶端是否存在該語音合成檔案，在客戶端存在該語音合成檔案時優先調用客戶端快取的語音合成檔案，減少整個語音系統的回應時間； 2、APP開發者透過已註冊APP對應的伺服器可以訓練出APP開發者定制化的語音模型，再將預先儲存的文本輸入至APP開發者定制化的語音模型以生成APP開發者需要使用語音合成檔案，已註冊APP需要使用其中的語音合成檔案時再將對應的語音合成檔案下載以供已註冊APP進行語音播放； 3、該語音系統可以支援多個已註冊APP，使得該語音系統使用率得到充分利用。The embodiments of this specification provide a method and device for invoking a speech synthesis file, which solves the above-mentioned problems raised by the prior art. To solve the above technical problems, the embodiments of this specification are implemented as follows: An embodiment of this specification provides a method for invoking a speech synthesis file, which includes: Detecting whether the client has a speech synthesis file that needs to be used by a registered APP, and the registered APP is an APP that needs to use the speech synthesis file registered in advance; If it is detected that the voice synthesis file does not exist in the client, the voice synthesis file is downloaded from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP, and the voice configuration file is built-in The download address of the speech synthesis file; If it is detected that the voice synthesis file exists on the client, the voice synthesis file of the client is invoked for the registered APP to perform voice playback according to the voice synthesis file. Optionally, before the detecting whether the client terminal has a speech synthesis file that needs to be used by a registered APP, the method further includes: Pull the voice configuration file from the server corresponding to the registered APP; Receive the voice configuration file issued by the server corresponding to the registered APP. The issued voice configuration file includes the server corresponding to the registered APP encrypting the issued voice configuration file and then distributes it to The first verification information corresponding to the registered APP; Determining whether the first verification information matches the second verification information pre-stored by the client; When it is determined that the first verification information matches the second verification information pre-stored by the client, it is verified that the issued voice configuration file is correct. Optionally, the judging whether the first verification information matches the second verification information pre-stored by the client specifically includes: The second verification information corresponding to the registered APP that is pre-stored in the client's secure operating environment according to the identifier of the registered APP; Determine whether the first verification information matches the second verification information. Optionally, before the pulling the voice configuration file from the server corresponding to the registered APP, the method further includes: Send the voice data that reflects the characteristics of the APP developer provided by the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP trains the office through the built-in voice basic training model The speech model customized by the APP developer is described, and the pre-stored text is input into the speech model customized by the APP developer to generate the speech synthesis file required by the registered APP, and the basic speech training model is based on the registered The APP needs to use a number of pre-provided speech samples to train a model that can be shared by registered APPs. Optionally, before the registered APP performs voice playback according to the voice synthesis file, the method further includes: Calculating the first summary value corresponding to the speech synthesis file; Determining whether the second summary value corresponding to the speech synthesis file pre-stored in the speech configuration file is the same as the first summary value; If it is determined that the second summary value is the same as the first summary value, the registered APP performs voice playback according to the speech synthesis file. Optionally, the registered APP performs voice playback according to the speech synthesis file, which specifically includes: the server corresponding to the registered APP encrypts the speech synthesis file according to a preset rule; the encrypted speech synthesis After the file is decrypted according to the built-in decryption module, the registered APP performs voice playback. An embodiment of this specification provides a voice synthesis file calling device, the device includes: The detection unit is used to detect whether the client terminal has a speech synthesis file that needs to be used by a registered APP, and the registered APP is an APP that needs to use the speech synthesis file registered in advance; The downloading unit is configured to download the speech synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP, if it is detected that the speech synthesis file does not exist in the client, the speech The download address of the speech synthesis file is built in the configuration file; The calling unit is configured to, if it is detected that the voice synthesis file exists on the client, call the voice synthesis file of the client so that the registered APP can perform voice playback according to the voice synthesis file. Optionally, the device further includes: The pulling unit is configured to pull the voice configuration file to the server corresponding to the registered APP; The receiving unit is configured to receive the voice configuration file issued by the server corresponding to the registered APP, and the issued voice configuration file includes the server corresponding to the registered APP performing the voice configuration file issued After being encrypted, the first verification information corresponding to the registered APP is allocated; A judging unit for judging whether the first verification information matches the second verification information pre-stored by the client; The verification unit is used to verify that the issued voice configuration file is correct when it is determined that the first verification information matches the second verification information pre-stored by the client. Optionally, the judgment unit is specifically configured to: The second verification information corresponding to the registered APP that is pre-stored in the client's secure operating environment according to the identifier of the registered APP; Determine whether the first verification information matches the second verification information. Optionally, the device further includes: The training unit is used to send voice data reflecting the characteristics of the APP developer provided by the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP can use the built-in voice foundation The training model trains the voice model customized by the APP developer, and generates a voice synthesis file corresponding to the registered APP according to the pre-stored text from the voice model customized by the APP developer, based on the basic voice training model The registered APP needs to use a number of pre-provided voice samples to train a model that can be shared by registered APPs. Optionally, the device further includes: A calculation unit for calculating the first summary value corresponding to the speech synthesis file; The determining unit is further configured to determine whether the second summary value corresponding to the speech synthesis file pre-stored in the speech configuration file is the same as the first summary value; If the determining unit determines that the second summary value is the same as the first summary value, the registered APP performs voice playback according to the speech synthesis file. Optionally, the registered APP performs voice playback according to the speech synthesis file, which specifically includes: the server corresponding to the registered APP encrypts the speech synthesis file according to a preset rule; the encrypted speech synthesis After the file is decrypted according to the built-in decryption module, the registered APP performs voice playback. A voice system provided by an embodiment of this specification includes a terminal and a server. The terminal includes a voice SDK running in the terminal, a registered APP, and an APP developer terminal; The APP developer terminal is used to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server terminal corresponding to the registered APP; The server is used to train a voice model customized by the APP developer through a built-in voice basic training model, and input the pre-stored text into the voice model customized by the APP developer to generate the registered APP needs The voice synthesis file, the voice basic training model is a model that is trained by using a number of pre-provided voice samples according to the needs of the registered APP to play voice and can be shared by the registered APP; The voice SDK is used to pull the voice configuration file from the server corresponding to the registered APP; receive the voice configuration file issued by the server corresponding to the registered APP, and the issued voice configuration file includes all The server corresponding to the registered APP encrypts the issued voice configuration file and assigns it to the first verification information corresponding to the registered APP; determining that the first verification information is the same as the second pre-stored client Verify whether the verification information matches; when it is determined that the first verification information matches the second verification information pre-saved by the client, verify that the issued voice configuration file is correct; check whether the client has a registered APP that needs to be used The registered APP is an APP that needs to use a speech synthesis file registered in advance; if it is detected that the speech synthesis file does not exist in the client, the registered APP corresponds to the registered APP according to the voice configuration file corresponding to the registered APP The server downloads the speech synthesis file, and the speech configuration file has a built-in download address of the speech synthesis file; if it is detected that the speech synthesis file exists in the client, the speech synthesis file of the client is called to For the registered APP to perform voice playback according to the voice synthesis file. An embodiment of the present specification provides a computer-readable medium having computer-readable instructions stored thereon, and the computer-readable instructions can be executed by a processor in the following steps: Detecting whether the client has a speech synthesis file that needs to be used by a registered APP, and the registered APP is an APP that needs to use the speech synthesis file registered in advance; If it is detected that the voice synthesis file does not exist in the client, the voice synthesis file is downloaded from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP, and the voice configuration file is built-in The download address of the speech synthesis file; If it is detected that the voice synthesis file exists on the client, the voice synthesis file of the client is called for the registered APP to perform voice playback according to the voice synthesis file. The embodiment of this specification provides a voice synthesis file calling device, which includes a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, Trigger the device to perform the following steps: The detection unit is used to detect whether the client terminal has a speech synthesis file that needs to be used by a registered APP, and the registered APP is an APP that needs to use the speech synthesis file registered in advance; The downloading unit is configured to download the speech synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP, if it is detected that the speech synthesis file does not exist in the client, the speech The download address of the speech synthesis file is built in the configuration file; The calling unit is configured to, if it is detected that the speech synthesis file exists in the client, call the speech synthesis file of the client, so that the registered APP can perform voice playback according to the speech synthesis file. The above at least one technical solution adopted in the embodiments of this specification can achieve the following beneficial effects: 1. When a registered APP needs to use a speech synthesis file, check whether the speech synthesis file exists on the client side, and call the speech synthesis file cached by the client first when the speech synthesis file exists on the client side, reducing the response time of the entire speech system; 2. The app developer can train the app developer's customized voice model through the server corresponding to the registered app, and then input the pre-stored text into the app developer's customized voice model to generate the app developer's need to use speech synthesis File, when the registered APP needs to use the speech synthesis file, download the corresponding speech synthesis file for the registered APP for voice playback; 3. The voice system can support multiple registered apps, so that the voice system usage rate can be fully utilized.

為了使本技術領域的人員更好地理解本說明書中的技術方案，下面將結合本說明書實施例中的圖式，對本說明書實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本申請案一部分實施例，而不是全部的實施例。基於本說明書實施例，本領域普通技術人員在沒有作出進步性勞動前提下所獲得的所有其他實施例，都應當屬於本申請案保護的範圍。圖1為本說明書實施例提供的一種語音合成檔案的調用方法的流程示意圖，該流程示意圖包括：步驟S101，檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，若存在，則執行步驟S102，若不存在，則執行步驟S103。在本說明書實施例的步驟S101中，可由語音SDK執行檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案的步驟，語音SDK設置有同時連接多個APP的介面，即APP向語音SDK進行註冊，就是將APP資料連接至語音SDK，已註冊APP為預先向語音SDK註冊且需要使用語音合成檔案的APP。其中，在本實施例中語音SDK為APP開發者在開發軟體時的框架。在本說明書實施例的步驟S101中，語音合成檔案是由已註冊APP對應的伺服端根據APP開發者的需求訓練出的。首先APP開發者向已註冊APP對應的伺服端發送APP開發者提供的反映APP開發者特徵的語音資料，以便已註冊APP對應的伺服端透過內建的語音基礎訓練模型訓練出APP開發者定制化的語音模型，並將預先儲存的文本輸入APP開發者定制化的語音模型以生成已註冊APP需要的語音合成檔案。語音基礎訓練模型為根據已註冊APP播放語音的需要利用預先提供的若干語音樣本訓練得到的、可供已註冊APP共用的模型。其中，若干語音樣本為已註冊APP對應的伺服端儲存的高品質語音資料。進一步的，在本說明書實施例的步驟S101中，語音基礎訓練模型根據整個語音系統的精確度來確定高品質語音資料的採樣時間，在整個語音系統要求的精確度高時時，高品質語音資料的採樣時間可以為300小時，但對於整個語音系統要求的精確度不高時，高品質語音資料的採樣時間選取100小時。在本說明書實施例的步驟S101中，已註冊APP對應的伺服端在訓練出語音基礎訓練模型後，APP開發者上傳反映APP開發者特徵的語音資料至已註冊APP對應的伺服端，透過語音基礎訓練模型訓練出APP開發者定制化的語音模型。反映APP開發者特徵的語音資料是根據APP開發者需求的語言環境而錄製的語音資料。此時APP開發者只需將少量上傳的語音資料上傳至已註冊APP對應的伺服端。其中，語音基礎訓練模型可以理解為已註冊APP對應的伺服端為APP開發者提供的資料集很大的中間模型，然後將此中間模型為APP開發者上傳的語音資料進行調優訓練，從而得出反映APP開發者特徵的定制化的語音模型。在本說明書實施例的步驟S101中，APP開發者上傳的語音資料需要進行審核，在生成反映APP開發者特徵的定制化的語音模型後，由該語音系統的管理人員進行審核，此時審核的機制可以為審核透過後該反映APP開發者特徵的定制化的語音模型才可以正常使用，也就是說即使生成了反映APP開發者特徵的定制化的語音模型，但未經過審核人員審核透過，該反映APP開發者特徵的定制化的語音模型也是無法正常使用的；同時審核的機制也可以為不管該反映APP開發者特徵的定制化的語音模型的審核結果是否透過，已註冊的APP皆可正常使用，但是審核人員一旦檢測出該反映APP開發者特徵的定制化的語音模型不合格時，該反映APP開發者特徵的定制化的語音模型即失效。在本說明書實施例的步驟S101中，若是APP開發者不採用這種方案，而是使用傳統的方法實現定制化的要求，其一是APP開發者直接上傳反映APP開發者特徵的語音資料，不經過任何處理，這樣做使得強健性低；其二是APP開發者單獨製作反映APP開發者特徵的定制化的語音模型，該過程在執行的過程時間長，而且無法保證定制化的語音模型的品質。在本說明書實施例的步驟S101中，該語音系統也可以應用於視頻系統中，即已註冊APP對應的伺服端內儲存視頻基礎訓練模型。步驟S102，調用客戶端的語音合成檔案。在本說明書實施例S102中，已向語音SDK已註冊APP有需要使用的語音合成檔案時，語音SDK優先檢測客戶端是否存在，在客戶端存在所需要調用的配置檔案時，調用存放在客戶端的語音合成檔案，已註冊APP可以根據語音合成檔案進行語音播放。步驟S103，根據預先儲存的已註冊APP對應的語音配置檔案從已註冊APP對應的伺服端下載語音合成檔案。在本說明書實施例的步驟S103中，語音合成檔案是根據預先儲存的文本並由APP開發者定制化的語音模型所生成。若是在步驟S102的判斷時語音合成檔案不存在，說明該語音合成檔案以前從未被已註冊APP下載。在本說明書實施例的步驟S103中，語音配置檔案內建有語音合成檔案的下載位址，已註冊APP根據該語音合成檔案的下載位址下載所需要使用的語音合成檔案，以供已註冊APP根據語音合成檔案進行語音播放。在本說明書實施例的步驟S103中，已註冊APP根據語音合成檔案進行語音播放之前，還需要對語音合成檔案進行驗證，具體步驟可以為：步驟1、計算語音合成檔案對應的第一摘要值。在本說明書實施例的步驟1中，語音合成檔案對應的第一摘要值檢查下載的語音合成檔案是否出錯，或者說下載的語音合成檔案是否被篡改的參數值。在本實施例中可以採用MD5摘要實現，其中，MD5是一種被廣泛使用的密碼雜湊函數，可以產生出一個128位元(16位元組)的雜湊值(hash value)，用於確保下載的語音配置檔案是否出錯，或者下載的語音配置檔案是否被篡改。比如，在Unix下有很多軟體在下載的時候都有一個檔案案名相同，檔案副檔案名為.md5的檔案，在這個檔案中通常只有一行文本，大致結構如： MD5(tanajiya.tar.gz)= 38b8c2c1093dd0fec383a9d9ac940515 這就是tanajiya.tar.gz檔案的數位簽章。MD5將整個檔案當作一個大文本資訊，透過其不可逆的字串變換演算法，產生了這個唯一的MD5資訊摘要。通俗來說，地球上任何人都有自己獨一無二的指紋，這常常成為司法機關鑒別罪犯身份最值得信賴的方法；與之類似，MD5就可以為任何檔案(不管其大小、格式、數量)產生一個同樣獨一無二的“數字指紋”，如果任何人對檔案做了任何改動，其MD5值也就是對應的“數位指紋”都會發生變化。下載網站中的MD5，它的作用就在於我們可以在下載檔案案後，對下載的檔案用專門的軟體(如Windows MD5 Check等)做一次MD5校驗，以確保我們獲得的檔案與該網站提供的檔案為同一檔案。具體來說檔案的MD5值就像是這個檔案的“數字指紋”。每個檔案的MD5值是不同的，如果任何人對檔案做了任何改動，其MD5值也就是對應的“數字指紋”就會發生變化。比如下載伺服器針對一個檔案預先提供一個MD5值，使用者下載完該檔案後，重新計算下載檔案案的MD5值，透過比較這兩個值是否相同，就能判斷下載的檔案是否出錯，或者說下載的檔案是否被篡改了。在本說明書實施例的步驟1中，計算第一摘要值是檢查下載的語音合成檔案是否出錯，或者說下載的語音合成檔案是否被篡改，實現對語音合成檔案錯誤的即時檢測，一旦語音合成檔案內容發生錯誤，將直觀地報出錯誤資訊，防止錯誤在應用程式中蔓延。此外，檢測語音合成檔案的檢查也可採用SHA256摘要實現。步驟2，判斷根據語音配置檔案內預先儲存的語音合成檔案對應的第二摘要值與第一摘要值是否相同，若相同，則執行步驟3，若不相同，則返回步驟S103。步驟3，已註冊APP根據語音合成檔案進行語音播放。在本說明書實施例的步驟3中，已註冊APP對應的伺服端可以根據內建的私密金鑰進行加密，加密的語音合成檔案播放時需要根據解密模組儲存的公開金鑰解密後進行語音播放。在本說明書實施例的步驟S103中，語音基礎訓練模型內配置有通用的語音資料庫，該通用的語音資料庫內包括交易金額、時間的語音播報，即APP開發者在文本中輸入數位時透過定制化的語音模型可直接轉化為交易的金額的語音或時間的語音合成檔案，而非單純的數位朗讀，例如，文本中寫入5：00時，語音合成檔案播放出的語音為時間為5點。上述步驟中，已註冊APP需要使用語音合成檔案時，檢測客戶端是否存該語音合成檔案，在客戶端存在該語音合成檔案時優先調用客戶端快取的語音合成檔案，減少整個語音系統的回應時間。進一步的，為了該語音系統可以應用在安全的環境下，對於上述實施例做出了改變，圖2為本說明書實施例提供的一種語音合成檔案的調用方法的流程示意圖，該流程示意圖包括：步驟S201，向已註冊APP對應的伺服端拉取語音配置檔案。在本說明書實施例的步驟S201中，已註冊APP對應的定制化的語音模型將預先儲存的文本轉化為語音合成檔案，已註冊APP對應的語音配置檔案包括該語音合成檔案的語音清單。步驟S202，接收已註冊APP對應的伺服端下發的語音配置檔案，下發語音配置檔案包括已註冊APP對應的伺服端對下發的語音配置檔案進行加密後，分配給已註冊APP對應的第一驗證資訊。在本說明書實施例的步驟S202中，開發者APP在語音SDK進行註冊，語音SDK連接有解密模組，該解密模組內可以透過TSM下發解密的公開金鑰，該公開金鑰為對應已註冊APP對應的唯一公開金鑰，伺服端配置有對應的私密金鑰，已註冊APP對應的伺服端對下發的語音配置檔案由該私密金鑰進行加密。公開金鑰與私密金鑰是一個金鑰對，公開金鑰是金鑰對中公開的部分，私密金鑰則是非公開的部分。公開金鑰和私密金鑰組成的金鑰對能保證在是唯一的。使用這個金鑰對的時候，如果用其中一個金鑰加密一段資料，必須用另一個金鑰解密。比如用公開金鑰加密資料就必須用私密金鑰解密，如果用私密金鑰加密也必須用公開金鑰解密，否則解密將不會成功。進一步的，在本說明書實施例的步驟S202中，解密模組可以為SE模組，SE模組為確保系統安全的模組，透過安全晶片和晶片作業系統(COS)實現資料安全儲存、加解密運算等功能。SE模組在安全體系裡主要功能包括：金鑰的安全儲存、資料加密運算和資訊的安全存放。金鑰的安全儲存可建立相對完善的金鑰管理體系，保證金鑰不可被讀取。資料加密運算包括對於可靠的安全演算法的支援、敏感性資料密文傳輸和資料傳輸防篡改等。資訊安全存放指的是嚴格的檔案存取權限機制和可靠的認證演算法和流程。本實施例中是將公開金鑰放置SE模組中。SE模組可封裝成各種形式，常見的有智慧卡和嵌入式安全模組(eSE)等。本實施例中可以針對語音系統的語音SDK植入嵌入式安全模組(eSE)，並採用滿足CCEAL5+安全等級要求的智慧安全晶片, 內建安全作業系統，滿足終端的安全金鑰儲存、資料加密服務的需求。使得該語音系統可廣泛應用於金融、地圖導航、城市交通、醫療、零售等領域，能保護系統在使用時安全性。步驟S203，判斷第一驗證資訊與客戶端預先保存的第二驗證資訊是否匹配，若是，則執行步驟S204，若否，則結束流程。在本說明書實施例的步驟S203中，根據已註冊APP的標識從內建於客戶端安全運行環境中預先保存的與已註冊APP對應的第二驗證資訊；判斷第一驗證資訊與第二驗證資訊是否匹配。其中，已註冊APP的標識是該已註冊APP的身份資訊。步驟S204，驗證下發的語音配置檔案正確。步驟S205，檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，若存在，則執行步驟S206，若不存在，則執行步驟S207。在本說明書實施例的步驟S205中，同上述步驟S101，不再贅述。步驟S206，調用客戶端的語音合成檔案。在本說明書實施例的步驟S206中，同上述步驟S102，不再贅述。步驟S207，根據已註冊APP對應的語音配置檔案從已註冊APP對應的伺服端下載語音合成檔案。在本說明書實施例的步驟S207中，同上述步驟S103，不再贅述。進一步的，本實施例中的語音系統還存在著伺服端和已註冊APP的同步問題，為了解決該問題，可以支援伺服端主動推送的方式，即客戶端的語音合成檔案發生變化時伺服端主動向客戶端進行推送。圖3為本說明書實施例提供的一種語音合成檔案的調用裝置的結構示意圖，該結構示意圖包括：檢測單元1、調用單元2、下載單元3、拉取單元4、接收單元5、判斷單元6、驗證單元7、訓練單元8與計算單元9。檢測單元1用於檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，已註冊APP為預先註冊需要使用語音合成檔案的APP。調用單元2用於若檢測出客戶端存在語音合成檔案，調用客戶端的語音合成檔案，以供已註冊APP根據語音合成檔案進行語音播放。下載單元3用於若檢測出客戶端不存在語音合成檔案，根據預先儲存的已註冊APP對應的語音配置檔案從已註冊APP對應的伺服端下載語音合成檔案，語音配置檔案內建有語音合成檔案的下載位址。拉取單元4用於向已註冊APP對應的伺服端拉取語音配置檔案。接收單元5用於接收已註冊APP對應的伺服端下發的語音配置檔案，下發語音配置檔案包括已註冊APP對應的伺服端對下發的語音配置檔案進行加密後，分配給已註冊APP對應的第一驗證資訊。判斷單元6用於判斷第一驗證資訊與客戶端預先保存的第二驗證資訊是否匹配；驗證單元7用於在判斷出第一驗證資訊與客戶端預先保存的第二驗證資訊匹配時，驗證下發的語音配置檔案正確。判斷單元6具體用於：根據已註冊APP的標識從內建於客戶端安全運行環境中預先保存的與已註冊APP對應的第二驗證資訊；判斷第一驗證資訊與第二驗證資訊是否匹配。訓練單元8用於向已註冊APP對應的伺服端發送APP開發者提供的反映APP開發者特徵的語音資料，以便已註冊APP對應的伺服端透過內建的語音基礎訓練模型訓練出APP開發者定制化的語音模型，並根據預先儲存的文本由APP開發者定制化的語音模型生成已註冊APP對應的語音合成檔案，語音基礎訓練模型為根據已註冊APP播放語音的需要利用預先提供的若干語音樣本訓練得到的、可供已註冊APP共用的模型。計算單元9用於計算語音合成檔案對應的第一摘要值；判斷單元6還用於判斷根據語音配置檔案內預先儲存的語音合成檔案對應的第二摘要值與第一摘要值是否相同；判斷單元6若判斷出第二摘要值與第一摘要值相同時，則已註冊APP根據語音合成檔案進行語音播放。已註冊APP根據語音合成檔案進行語音播放，具體包括：已註冊APP對應的伺服端根據預設規則對語音合成檔案進行加密；加密的語音合成檔案根據內建解密模組解密後，由已註冊APP進行語音播放。本說明書實施例還提供了一種電腦可讀媒體，其上儲存有電腦可讀指令，電腦可讀指令可被處理器執行以下步驟：檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，已註冊APP為預先註冊需要使用語音合成檔案的APP；若檢測出客戶端不存在語音合成檔案，根據預先儲存的已註冊APP對應的語音配置檔案從已註冊APP對應的伺服端下載語音合成檔案，語音配置檔案內建有語音合成檔案的下載位址；若檢測出客戶端存在語音合成檔案，調用客戶端的語音合成檔案，以供已註冊APP根據語音合成檔案進行語音播放。本說明書實施例還提供一種語音合成檔案的調用設備，該設備包括用於儲存電腦程式指令的記憶體和用於執行程式指令的處理器，其中，當該電腦程式指令被該處理器執行時，觸發該設備執行以下步驟：檢測單元，用於檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，已註冊APP為預先註冊需要使用語音合成檔案的APP；下載單元，用於若檢測出客戶端不存在語音合成檔案，根據預先儲存的已註冊APP對應的語音配置檔案從已註冊APP對應的伺服端下載語音合成檔案，語音配置檔案內建有語音合成檔案的下載位址；調用單元，用於若檢測出客戶端存在語音合成檔案，調用客戶端的語音合成檔案，以供已註冊APP根據語音合成檔案進行語音播放。本說明書實施例提供的一種語音系統，包括終端、伺服器，終端包括運行在終端中的語音SDK、已註冊APP以及APP開發者端； APP開發者端用於向已註冊APP對應的伺服端發送APP開發者提供的反映APP開發者特徵的語音資料；伺服端用於透過內建的語音基礎訓練模型訓練出APP開發者定制化的語音模型，並將預先儲存的文本輸入APP開發者定制化的語音模型以生成已註冊APP需要的語音合成檔案，語音基礎訓練模型為根據已註冊APP播放語音的需要利用預先提供的若干語音樣本訓練得到的、可供已註冊APP共用的模型；語音SDK用於向已註冊APP對應的伺服端拉取語音配置檔案；接收已註冊APP對應的伺服端下發的語音配置檔案，下發語音配置檔案包括已註冊APP對應的伺服端對下發的語音配置檔案進行加密後，分配給已註冊APP對應的第一驗證資訊；判斷第一驗證資訊與客戶端預先保存的第二驗證資訊是否匹配；在判斷出第一驗證資訊與客戶端預先保存的第二驗證資訊匹配時，則驗證下發的語音配置檔案正確;檢測客戶端是否存在已註冊APP所需要使用的語音合成檔案，已註冊APP為預先註冊需要使用語音合成檔案的APP；若檢測出客戶端不存在語音合成檔案，根據已註冊APP對應的語音配置檔案從已註冊APP對應的伺服端下載語音合成檔案，語音配置檔案內建有語音合成檔案的下載位址;若檢測出客戶端存在語音合成檔案，調用客戶端的語音合成檔案，以供已註冊APP根據語音合成檔案進行語音播放。本領域內的技術人員應明白，本發明的實施例可提供為方法、系統、或電腦程式產品。因此，本發明可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。本發明是參照根據本發明實施例的方法、設備(系統)、和電腦程式產品的流程圖和/或方塊圖來描述的。應理解可由電腦程式指令實現流程圖和/或方塊圖中的每一流程和/或方塊、以及流程圖和/或方塊圖中的流程和/或方塊的結合。可提供這些電腦程式指令到電腦、專用電腦、嵌入式處理機或其他可編程資料處理設備的處理器以產生一個機器，使得透過電腦或其他可編程資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的裝置。這些電腦程式指令也可儲存在能引導電腦或其他可編程資料處理設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能。這些電腦程式指令也可裝載到電腦或其他可編程資料處理設備上，使得在電腦或其他可編程設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可編程設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的步驟。在一個典型的配置中，計算設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。記憶體可能包括電腦可讀媒體中的非永久性記憶體，隨機存取記憶體(RAM)和/或非易失性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可抹除可編程唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁磁片儲存或其他磁性儲存設備或任何其他非傳輸媒體，可用於儲存可以被計算設備訪問的資訊。按照本文中的界定，電腦可讀媒體不包括暫存電腦可讀媒體(transitory media)，如調變的資料信號和載波。還需要說明的是，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、商品或者設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、商品或者設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個……”限定的要素，並不排除在包括要素的過程、方法、商品或者設備中還存在另外的相同要素。以上僅為本說明書的實施例而已，並不用於限制本說明書。對於本領域技術人員來說，本說明書可以有各種更改和變化。凡在本說明書的精神和原理之內所作的任何修改、等同替換、改進等，均應包含在本說明書的申請專利範圍之內。In order to enable those skilled in the art to better understand the technical solutions in this specification, the following will clearly and completely describe the technical solutions in the embodiments of this specification in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments of this specification, all other embodiments obtained by those of ordinary skill in the art without making progressive labor should fall within the scope of protection of this application. FIG. 1 is a schematic flowchart of a method for invoking a speech synthesis file according to an embodiment of this specification, and the schematic flowchart includes: In step S101, it is detected whether the client terminal has a speech synthesis file that needs to be used by the registered APP, if it exists, step S102 is executed, and if it does not exist, step S103 is executed. In step S101 of the embodiment of this specification, the voice SDK can perform the step of detecting whether the client has a voice synthesis file that needs to be used by a registered APP. The voice SDK is provided with an interface to connect multiple APPs at the same time, that is, the APP performs the process of Registration means to connect the APP data to the voice SDK. The registered app is an app that is registered with the voice SDK in advance and needs to use speech synthesis files. Among them, in this embodiment, the voice SDK is a framework for APP developers when developing software. In step S101 of the embodiment of this specification, the speech synthesis file is trained by the server corresponding to the registered APP according to the requirements of the APP developer. First, the app developer sends the voice data provided by the app developer reflecting the characteristics of the app developer to the server corresponding to the registered app, so that the server corresponding to the registered app can train the app developer customization through the built-in voice basic training model The voice model of the app, and input the pre-stored text into the voice model customized by the APP developer to generate the speech synthesis file required by the registered APP. The basic voice training model is a model that is trained using several pre-provided voice samples according to the needs of registered APPs to play voices and can be shared by registered apps. Among them, several voice samples are high-quality voice data stored by the server corresponding to the registered APP. Further, in step S101 of the embodiment of this specification, the basic speech training model determines the sampling time of high-quality speech data according to the accuracy of the entire speech system. When the accuracy required by the entire speech system is high, the high-quality speech data The sampling time can be 300 hours, but when the accuracy required for the entire voice system is not high, the sampling time of high-quality voice data is selected as 100 hours. In the step S101 of the embodiment of this specification, after the server corresponding to the registered APP has trained the basic voice training model, the APP developer uploads the voice data reflecting the characteristics of the APP developer to the server corresponding to the registered APP, through the voice basis The training model trains a customized voice model for APP developers. The voice data reflecting the characteristics of the APP developer is the voice data recorded according to the language environment required by the APP developer. At this time, APP developers only need to upload a small amount of uploaded voice data to the server corresponding to the registered APP. Among them, the basic speech training model can be understood as an intermediate model with a large data set provided by the server corresponding to the registered APP for the APP developer, and then this intermediate model is the speech data uploaded by the APP developer for tuning and training. A customized voice model that reflects the characteristics of APP developers. In step S101 of the embodiment of this specification, the voice data uploaded by the APP developer needs to be reviewed. After a customized voice model reflecting the characteristics of the APP developer is generated, the voice system administrator will review it. The mechanism can be that the customized voice model that reflects the characteristics of the APP developer can be used normally after the review is passed. That is to say, even if a customized voice model that reflects the characteristics of the APP developer is generated, it has not been approved by the reviewer. The customized voice model that reflects the characteristics of the APP developer cannot be used normally; at the same time, the audit mechanism can also be that regardless of whether the audit result of the customized voice model that reflects the characteristics of the APP developer is passed, the registered APP can be normal Use, but once the auditor detects that the customized voice model reflecting the characteristics of the APP developer is unqualified, the customized voice model reflecting the characteristics of the APP developer becomes invalid. In step S101 of the embodiment of this specification, if the APP developer does not adopt this solution, but uses the traditional method to achieve the customization requirements, the first is that the APP developer directly uploads the voice data reflecting the characteristics of the APP developer. After any processing, doing so makes the robustness low; the second is that the APP developer separately produces a customized voice model that reflects the characteristics of the APP developer. This process takes a long time to execute and cannot guarantee the quality of the customized voice model. . In step S101 of the embodiment of this specification, the voice system can also be applied to a video system, that is, the basic video training model is stored in the server corresponding to the registered APP. Step S102, call the speech synthesis file of the client. In the embodiment S102 of this specification, when the APP has registered with the voice SDK and has a speech synthesis file that needs to be used, the voice SDK first detects whether the client exists, and when the client has a configuration file that needs to be called, it calls the file stored on the client Speech synthesis files, registered APP can perform voice playback according to speech synthesis files. Step S103: Download the speech synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP. In step S103 of the embodiment of this specification, the speech synthesis file is generated based on the pre-stored text and the speech model customized by the APP developer. If the speech synthesis file does not exist in the judgment of step S102, it means that the speech synthesis file has never been downloaded by a registered APP before. In step S103 of the embodiment of this specification, the voice configuration file has a built-in download address of the speech synthesis file, and the registered APP downloads the required speech synthesis file according to the download address of the speech synthesis file for the registered APP Perform voice playback based on speech synthesis files. In step S103 of the embodiment of this specification, the registered APP needs to verify the speech synthesis file before performing voice playback according to the speech synthesis file. The specific steps may be: Step 1. Calculate the first summary value corresponding to the speech synthesis file. In step 1 of the embodiment of the present specification, the first summary value corresponding to the speech synthesis file checks whether the downloaded speech synthesis file is wrong, or whether the downloaded speech synthesis file has been tampered with. In this embodiment, MD5 digest can be used. MD5 is a widely used cryptographic hash function, which can generate a 128-bit (16-byte) hash value to ensure downloading Whether the voice configuration file is wrong, or whether the downloaded voice configuration file has been tampered with. For example, many software under Unix have a file with the same file name when downloading, and the file name is .md5. In this file, there is usually only one line of text, and the general structure is as follows: MD5(tanajiya.tar.gz)= 38b8c2c1093dd0fec383a9d9ac940515 This is the digital signature of the tanajiya.tar.gz file. MD5 treats the entire file as a large text information, and generates this unique MD5 information summary through its irreversible string transformation algorithm. In layman's terms, anyone on the earth has his own unique fingerprint, which is often the most reliable method for judiciary to identify criminals; similarly, MD5 can generate one for any file (regardless of its size, format, or quantity). The same unique "digital fingerprint", if anyone makes any changes to the file, its MD5 value, which is the corresponding "digital fingerprint", will change. Download the MD5 in the website. Its function is that after downloading the file, we can use special software (such as Windows MD5 Check, etc.) to do an MD5 check on the downloaded file to ensure that the file we obtain is the same as that provided by the website. 'S files are the same file. Specifically, the MD5 value of the file is like the "digital fingerprint" of the file. The MD5 value of each file is different. If anyone makes any changes to the file, the MD5 value, which is the corresponding "digital fingerprint", will change. For example, the download server provides an MD5 value for a file in advance. After downloading the file, the user recalculates the MD5 value of the downloaded file. By comparing the two values, it can be judged whether the downloaded file is wrong or not. Whether the downloaded file has been tampered with. In step 1 of the embodiment of this specification, calculating the first summary value is to check whether the downloaded speech synthesis file is wrong, or whether the downloaded speech synthesis file has been tampered with, so as to realize the real-time detection of the speech synthesis file error. If there is an error in the content, the error message will be reported intuitively to prevent the error from spreading in the application. In addition, the detection of speech synthesis files can also be implemented using SHA256 digests. Step 2: Determine whether the second summary value corresponding to the speech synthesis file pre-stored in the voice configuration file is the same as the first summary value, if they are the same, then perform step 3, if they are not the same, then return to step S103. Step 3. The registered APP performs voice playback according to the voice synthesis file. In step 3 of the embodiment of this specification, the server corresponding to the registered APP can be encrypted according to the built-in private key. When the encrypted speech synthesis file is played, it needs to be decrypted according to the public key stored in the decryption module for voice playback . In step S103 of the embodiment of this specification, a general voice database is configured in the basic voice training model, and the general voice database includes voice announcements of transaction amount and time. The customized speech model can be directly converted into a speech or time speech synthesis file of the transaction amount, rather than a simple digital reading. For example, when the text is written at 5:00, the speech synthesis file will be played at a time of 5. point. In the above steps, when the registered APP needs to use the speech synthesis file, check whether the speech synthesis file is stored in the client. When the speech synthesis file exists on the client, the speech synthesis file cached by the client is called first to reduce the response of the entire speech system time. Further, in order that the voice system can be applied in a safe environment, the foregoing embodiment has been changed. FIG. 2 is a schematic flowchart of a method for invoking a voice synthesis file provided by an embodiment of this specification. The schematic flowchart includes: Step S201: Pull the voice configuration file from the server corresponding to the registered APP. In step S201 of the embodiment of this specification, the customized speech model corresponding to the registered APP converts the pre-stored text into a speech synthesis file, and the speech configuration file corresponding to the registered APP includes a voice list of the speech synthesis file. Step S202: Receive the voice configuration file issued by the server corresponding to the registered APP. The issued voice configuration file includes that the server corresponding to the registered APP encrypts the issued voice configuration file and distributes it to the first corresponding to the registered APP. One verification information. In step S202 of the embodiment of this specification, the developer APP registers with the voice SDK, and the voice SDK is connected with a decryption module. The decryption module can issue a decrypted public key through TSM. The public key corresponds to the The only public key corresponding to the registered APP, the server is configured with a corresponding private key, and the server corresponding to the registered APP encrypts the issued voice configuration file by the private key. The public key and the private key are a key pair, the public key is the public part of the key pair, and the private key is the non-public part. The key pair composed of the public key and the private key can be guaranteed to be unique. When using this key pair, if one of the keys is used to encrypt a piece of data, the other key must be used to decrypt it. For example, if you use a public key to encrypt data, you must use a private key to decrypt it. If you use a private key to encrypt data, you must also use the public key to decrypt, otherwise the decryption will not succeed. Further, in step S202 of the embodiment of this specification, the decryption module may be an SE module. The SE module is a module to ensure system security. The security chip and the chip operating system (COS) are used to securely store, encrypt and decrypt data. Calculation and other functions. The main functions of the SE module in the security system include: secure storage of keys, data encryption operations, and secure storage of information. The secure storage of the key can establish a relatively complete key management system to ensure that the key cannot be read. Data encryption operations include support for reliable security algorithms, ciphertext transmission of sensitive data, and anti-tampering of data transmission. Information security storage refers to a strict file access authority mechanism and reliable authentication algorithms and procedures. In this embodiment, the public key is placed in the SE module. SE modules can be packaged in various forms, such as smart cards and embedded security modules (eSE). In this embodiment, an embedded security module (eSE) can be implanted for the voice SDK of the voice system, and a smart security chip that meets the requirements of CCEAL5+ security level is adopted, and a security operating system is built in to meet the security key storage and data encryption of the terminal Service demand. The voice system can be widely used in finance, map navigation, urban transportation, medical care, retail and other fields, and can protect the safety of the system during use. In step S203, it is determined whether the first verification information matches the second verification information pre-stored by the client, if so, step S204 is executed, and if not, the process is ended. In step S203 of the embodiment of this specification, the second verification information corresponding to the registered APP pre-stored in the client's secure operating environment is based on the identifier of the registered APP; the first verification information and the second verification information are determined Whether it matches. Among them, the identifier of the registered APP is the identity information of the registered APP. Step S204, verify that the issued voice configuration file is correct. In step S205, it is detected whether there is a speech synthesis file needed to use the registered APP in the client, if it exists, step S206 is executed, and if it does not exist, step S207 is executed. Step S205 in the embodiment of this specification is the same as the above step S101, and will not be repeated. Step S206, call the speech synthesis file of the client. Step S206 in the embodiment of this specification is the same as the above step S102, and will not be repeated. Step S207: Download the speech synthesis file from the server corresponding to the registered APP according to the speech configuration file corresponding to the registered APP. Step S207 in the embodiment of this specification is the same as the above step S103, and will not be repeated. Further, the voice system in this embodiment also has a synchronization problem between the server and the registered APP. In order to solve this problem, the server can support the active push method, that is, the server actively sends to the server when the voice synthesis file of the client changes. The client pushes. 3 is a schematic structural diagram of a device for invoking a speech synthesis file provided by an embodiment of this specification. The schematic structural diagram includes: a detection unit 1, a calling unit 2, a downloading unit 3, a pulling unit 4, a receiving unit 5, a judgment unit 6, The verification unit 7, the training unit 8 and the calculation unit 9. The detection unit 1 is used to detect whether the client terminal has a speech synthesis file that needs to be used by a registered APP. The registered APP is an APP that needs to use the speech synthesis file registered in advance. The calling unit 2 is used for calling the voice synthesis file of the client if it is detected that there is a voice synthesis file on the client side, so that the registered APP can perform voice playback according to the voice synthesis file. The downloading unit 3 is used to download the speech synthesis file from the server corresponding to the registered APP according to the pre-stored speech configuration file corresponding to the registered APP, if it is detected that there is no speech synthesis file in the client, the speech configuration file has a built-in speech synthesis file Download address. The pulling unit 4 is used to pull the voice configuration file from the server corresponding to the registered APP. The receiving unit 5 is used to receive the voice configuration file issued by the server corresponding to the registered APP. The issued voice configuration file includes that the server corresponding to the registered APP encrypts the issued voice configuration file and assigns it to the registered APP. The first verification information for. The determining unit 6 is configured to determine whether the first verification information matches the second verification information pre-stored by the client; The verification unit 7 is used for verifying that the issued voice configuration file is correct when it is determined that the first verification information matches the second verification information pre-stored by the client. The judging unit 6 is specifically used for: The second verification information corresponding to the registered APP that is pre-stored in the client's secure operating environment according to the identifier of the registered APP; Determine whether the first verification information matches the second verification information. The training unit 8 is used to send the voice data provided by the app developer reflecting the characteristics of the app developer to the server corresponding to the registered app, so that the server corresponding to the registered app can train the app developer customization through the built-in voice basic training model The voice model is customized according to the pre-stored text, and the voice model customized by the app developer generates the voice synthesis file corresponding to the registered app. The basic voice training model uses several pre-provided voice samples according to the needs of the registered app to play voice The trained model can be shared by registered apps. The calculation unit 9 is used to calculate the first summary value corresponding to the speech synthesis file; The judging unit 6 is also used to judge whether the second summary value corresponding to the speech synthesis file stored in advance in the speech configuration file is the same as the first summary value; If the judging unit 6 judges that the second summary value is the same as the first summary value, the registered APP performs voice playback according to the speech synthesis file. The registered APP performs voice playback according to the speech synthesis file, which specifically includes: the server corresponding to the registered APP encrypts the speech synthesis file according to preset rules; the encrypted speech synthesis file is decrypted according to the built-in decryption module, and the registered APP is decrypted Perform voice playback. The embodiment of this specification also provides a computer-readable medium on which computer-readable instructions are stored, and the computer-readable instructions can be executed by the processor in the following steps: Check whether there is a speech synthesis file required by the registered APP in the client. The registered APP is an APP that needs to use the speech synthesis file registered in advance; If it is detected that there is no speech synthesis file on the client, download the speech synthesis file from the server corresponding to the registered APP according to the pre-stored speech configuration file corresponding to the registered APP. The speech configuration file has a built-in speech synthesis file download address; If it is detected that there is a voice synthesis file on the client, the voice synthesis file of the client is called for the registered APP to perform voice playback based on the voice synthesis file. The embodiment of this specification also provides a device for calling a speech synthesis file. The device includes a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, Trigger the device to perform the following steps: The detection unit is used to detect whether there is a speech synthesis file required by the registered APP in the client. The registered APP is an APP that needs to use the speech synthesis file registered in advance; The download unit is used to download the speech synthesis file from the server corresponding to the registered APP according to the pre-stored speech configuration file corresponding to the registered APP, if it is detected that there is no speech synthesis file in the client, the speech configuration file has a built-in speech synthesis file Download address; The calling unit is used to call the voice synthesis file of the client if it is detected that there is a voice synthesis file on the client side, so that the registered APP can perform voice playback according to the voice synthesis file. A voice system provided by an embodiment of this specification includes a terminal and a server. The terminal includes a voice SDK running in the terminal, a registered APP, and an APP developer terminal; The APP developer terminal is used to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server corresponding to the registered APP; The server is used to train the voice model customized by the APP developer through the built-in voice basic training model, and input the pre-stored text into the voice model customized by the APP developer to generate the speech synthesis file required by the registered APP. The basic training model is a model that can be shared by registered apps, which is trained using several pre-provided voice samples according to the needs of registered apps to play voice; The voice SDK is used to pull the voice configuration file from the server corresponding to the registered APP; to receive the voice configuration file issued by the server corresponding to the registered APP, and the issued voice configuration file includes the server corresponding to the registered APP. After the voice configuration file is encrypted, it is assigned to the first verification information corresponding to the registered APP; it is determined whether the first verification information matches the second verification information pre-stored by the client; after judging whether the first verification information is pre-stored by the client When the second verification information matches, verify that the issued voice configuration file is correct; check whether there is a voice synthesis file required by the registered APP in the client. The registered APP is an APP that needs to use the voice synthesis file for pre-registration; if it is detected There is no speech synthesis file in the client, download the speech synthesis file from the server corresponding to the registered APP according to the speech configuration file corresponding to the registered APP. The speech configuration file has a built-in speech synthesis file download address; if the client is detected Voice synthesis file, call the voice synthesis file of the client, for the registered APP to perform voice playback based on the voice synthesis file. Those skilled in the art should understand that the embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may be in the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer-usable program codes. . The present invention is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a computer, dedicated computer, embedded processor or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated for A device that implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram. These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product that includes the instruction device, The instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram. These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, which can be executed on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in one flow or multiple flows in the flowchart and/or one block or multiple blocks in the block diagram. In a typical configuration, the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory. Memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory ( flash RAM). Memory is an example of computer-readable media. Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM) , Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only CD-ROM, digital multi-function disc (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves. It should also be noted that the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity or equipment that includes the element. The above are only examples of this specification and are not intended to limit this specification. For those skilled in the art, this specification can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this specification shall be included in the scope of patent application in this specification.

1:檢測單元 2:調用單元 3:下載單元 4:拉取單元 5:接收單元 6:判斷單元 7:驗證單元 8:訓練單元 9:計算單元 S101:步驟 S102:步驟 S103:步驟 S201:步驟 S202:步驟 S203:步驟 S204:步驟 S205:步驟 S206:步驟 S207:步驟1: Detection unit 2: call unit 3: download unit 4: pull unit 5: receiving unit 6: Judgment unit 7: Verification unit 8: Training unit 9: Computing unit S101: Step S102: Step S103: Step S201: Step S202: steps S203: Step S204: Step S205: Step S206: Step S207: Step

為了更清楚地說明本說明書實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的圖式作簡單地介紹，顯而易見地，下面描述中的圖式僅僅是本說明書中記載的一些實施例，對於本領域普通技術人員來講，在不付出進步性勞動性的前提下，還可以根據這些圖式獲得其他的圖式。 [圖1]為本說明書實施例一提供的語音合成檔案的調用方法的流程示意圖； [圖2]為本說明書實施例二提供的語音合成檔案的調用方法的流程示意圖； [圖3]為本說明書實施例三提供的語音合成檔案的調用裝置的結構示意圖； [圖4]為本說明書實施例四提供的語音系統的結構示意圖。In order to more clearly describe the technical solutions in the embodiments of this specification or the prior art, the following will briefly introduce the drawings that need to be used in the embodiments or the description of the prior art. Obviously, the drawings in the following description are merely the present For some of the embodiments described in the specification, for those of ordinary skill in the art, other schemes can be obtained based on these schemes without making progressive labor. [Figure 1] This is a schematic flowchart of the method for invoking a speech synthesis file provided in the first embodiment of this specification; [Figure 2] is a schematic flow chart of the method for invoking a speech synthesis file provided in the second embodiment of this specification; [Figure 3] This is a schematic diagram of the structure of the device for invoking a speech synthesis file provided in the third embodiment of this specification; [Figure 4] This is a schematic diagram of the structure of the voice system provided in the fourth embodiment of this specification.

Claims

A method for invoking a speech synthesis file, characterized in that the method includes: Detecting whether the client has a speech synthesis file that needs to be used by a registered APP, and the registered APP is an APP that needs to use the speech synthesis file registered in advance; If it is detected that the voice synthesis file does not exist in the client, the voice synthesis file is downloaded from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP, and the voice configuration file is built-in The download address of the speech synthesis file; If it is detected that the voice synthesis file exists on the client, the voice synthesis file of the client is called for the registered APP to perform voice playback according to the voice synthesis file.

The method for invoking a speech synthesis file according to item 1 of the scope of patent application, wherein, before the detecting whether the client terminal has a speech synthesis file that needs to be used by a registered APP, the method further includes: Pull the voice configuration file from the server corresponding to the registered APP; Receive the voice configuration file issued by the server corresponding to the registered APP. The issued voice configuration file includes the server corresponding to the registered APP encrypting the issued voice configuration file and then distributes it to The first verification information corresponding to the registered APP; Determining whether the first verification information matches the second verification information pre-stored by the client; When it is determined that the first verification information matches the second verification information pre-stored by the client, it is verified that the issued voice configuration file is correct.

The method for invoking a speech synthesis file according to item 2 of the scope of patent application, wherein the determining whether the first verification information matches the second verification information pre-stored by the client specifically includes: The second verification information corresponding to the registered APP that is pre-stored in the client's secure operating environment according to the identifier of the registered APP; Determine whether the first verification information matches the second verification information.

The method for invoking a speech synthesis file according to item 2 of the scope of patent application, wherein, before the pulling the speech configuration file from the server corresponding to the registered APP, the method further includes: Send the voice data that reflects the characteristics of the APP developer provided by the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP trains the office through the built-in voice basic training model The speech model customized by the APP developer is described, and the pre-stored text is input into the speech model customized by the APP developer to generate the speech synthesis file required by the registered APP, and the basic speech training model is based on the registered The APP needs to use a number of pre-provided speech samples to train a model that can be shared by registered APPs.

The method for invoking a speech synthesis file according to item 1 of the scope of patent application, wherein, before the registered APP performs voice playback according to the speech synthesis file, the method further includes: Calculating the first summary value corresponding to the speech synthesis file; Determining whether the second summary value corresponding to the speech synthesis file pre-stored in the speech configuration file is the same as the first summary value; If it is determined that the second summary value is the same as the first summary value, the registered APP performs voice playback according to the speech synthesis file.

The method for invoking a speech synthesis file according to item 1 of the scope of patent application, wherein the registered APP performs voice playback according to the speech synthesis file, which specifically includes: the server corresponding to the registered APP according to preset rules The speech synthesis file is encrypted; after the encrypted speech synthesis file is decrypted according to a built-in decryption module, the registered APP performs voice playback.

A device for invoking speech synthesis files, characterized in that the device comprises: The detection unit is used to detect whether the client terminal has a speech synthesis file that needs to be used by a registered APP, and the registered APP is an APP that needs to use the speech synthesis file registered in advance; The downloading unit is configured to download the speech synthesis file from the server corresponding to the registered APP according to the pre-stored voice configuration file corresponding to the registered APP, if it is detected that the speech synthesis file does not exist in the client, the speech The download address of the speech synthesis file is built in the configuration file; The calling unit is configured to, if it is detected that the speech synthesis file exists in the client, call the speech synthesis file of the client, so that the registered APP can perform voice playback according to the speech synthesis file.

The device for invoking a speech synthesis file according to item 7 of the scope of patent application, wherein the device further includes: The pulling unit is configured to pull the voice configuration file to the server corresponding to the registered APP; The receiving unit is configured to receive the voice configuration file issued by the server corresponding to the registered APP, and the issued voice configuration file includes the server corresponding to the registered APP performing the voice configuration file issued After being encrypted, the first verification information corresponding to the registered APP is allocated; A judging unit for judging whether the first verification information matches the second verification information pre-stored by the client; The verification unit is used to verify that the issued voice configuration file is correct when it is determined that the first verification information matches the second verification information pre-stored by the client.

According to the device for invoking a speech synthesis file according to item 8 of the scope of patent application, the determining unit is specifically configured to: The second verification information corresponding to the registered APP that is pre-stored in the client's secure operating environment according to the identifier of the registered APP; Determine whether the first verification information matches the second verification information.

The device for invoking a speech synthesis file according to item 8 of the scope of patent application, wherein the device further includes: The training unit is used to send voice data reflecting the characteristics of the APP developer provided by the APP developer to the server corresponding to the registered APP, so that the server corresponding to the registered APP can use the built-in voice foundation The training model trains the voice model customized by the APP developer, and generates a voice synthesis file corresponding to the registered APP according to the pre-stored text from the voice model customized by the APP developer, based on the basic voice training model The registered APP needs to use a number of pre-provided voice samples to train a model that can be shared by registered APPs.

The device for invoking a speech synthesis file according to item 7 of the scope of patent application, wherein the device further includes: A calculation unit for calculating the first summary value corresponding to the speech synthesis file; The determining unit is further configured to determine whether the second summary value corresponding to the speech synthesis file pre-stored in the speech configuration file is the same as the first summary value; If the determining unit determines that the second summary value is the same as the first summary value, the registered APP performs voice playback according to the speech synthesis file.

The device for calling speech synthesis files according to item 7 of the scope of patent application, wherein: The registered APP performs voice playback according to the speech synthesis file, specifically including: the server corresponding to the registered APP encrypts the speech synthesis file according to a preset rule; the encrypted speech synthesis file is based on a built-in After the decryption module is decrypted, the registered APP performs voice playback.

A voice system, characterized in that it includes a terminal and a server, and the terminal includes a voice SDK running in the terminal, a registered APP, and an APP developer terminal; The APP developer terminal is used to send the voice data provided by the APP developer reflecting the characteristics of the APP developer to the server terminal corresponding to the registered APP; The server is used to train a voice model customized by the APP developer through a built-in voice basic training model, and input the pre-stored text into the voice model customized by the APP developer to generate the registered APP needs The voice synthesis file, the voice basic training model is a model that is trained by using a number of pre-provided voice samples according to the needs of the registered APP to play voice and can be shared by the registered APP; The voice SDK is used to pull the voice configuration file from the server corresponding to the registered APP; receive the voice configuration file issued by the server corresponding to the registered APP, and the issued voice configuration file includes all The server corresponding to the registered APP encrypts the issued voice configuration file and assigns it to the first verification information corresponding to the registered APP; determining that the first verification information is the same as the second pre-stored client Verify whether the verification information matches; when it is determined that the first verification information matches the second verification information pre-saved by the client, verify that the issued voice configuration file is correct; check whether the client has a registered APP that needs to be used The registered APP is an APP that needs to use a speech synthesis file registered in advance; if it is detected that the speech synthesis file does not exist in the client, the registered APP corresponds to the registered APP according to the voice configuration file corresponding to the registered APP The server downloads the speech synthesis file, and the speech configuration file has a built-in download address of the speech synthesis file; if it is detected that the speech synthesis file exists in the client, the speech synthesis file of the client is called to For the registered APP to perform voice playback according to the voice synthesis file.

A computer-readable medium has computer-readable instructions stored thereon, and the computer-readable instructions can be executed by a processor to implement the method described in any one of items 1 to 6 of the scope of patent application.

A device for invoking a speech synthesis file. The device includes a memory for storing computer program instructions and a processor for executing the program instructions. When the computer program instructions are executed by the processor, the device is triggered to execute the patent application The method described in any one of items 1 to 6 in the range.