WO2017016135A1 - Voice synthesis method and system - Google Patents

Voice synthesis method and system Download PDF

Info

Publication number
WO2017016135A1
WO2017016135A1 PCT/CN2015/097162 CN2015097162W WO2017016135A1 WO 2017016135 A1 WO2017016135 A1 WO 2017016135A1 CN 2015097162 W CN2015097162 W CN 2015097162W WO 2017016135 A1 WO2017016135 A1 WO 2017016135A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
characteristic
library
text
acoustic
Prior art date
Application number
PCT/CN2015/097162
Other languages
French (fr)
Chinese (zh)
Inventor
李秀林
白洁
李维高
唐海员
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Publication of WO2017016135A1 publication Critical patent/WO2017016135A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • the present invention relates to the field of voice processing technologies, and in particular, to a voice synthesis method and system.
  • the user when downloading an offline voice synthesis application (APP), the user may include one or two sound banks, and when the user uses the application, a sound library is selected, and then the application uses the user selection.
  • the sound library performs Text To Speech (TTS) on the text to be played.
  • TTS Text To Speech
  • the prior art solution includes a sound bank in the APP. Since the sound file is generally large, the size of the APP is large, and on the other hand, the type of the sound library included in the APP is limited, resulting in limited user selection space. .
  • the present invention aims to solve at least one of the technical problems in the related art to some extent.
  • Another object of the present invention is to provide a speech synthesis system.
  • a speech synthesis method includes: when a speech synthesis is required, querying a list of available sound banks from a server, where the available sound library list includes information of a plurality of available sound banks.
  • the available sound library includes a featured sound library; the sound library selected by the user according to the available sound library list is obtained, and the sound bank selected by the user is downloaded from the server; and the downloaded sound library is used to synthesize the text into a voice.
  • the speech synthesis method proposed by the embodiment of the first aspect of the present invention can reduce the volume of the APP by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, and additionally, the sound is included in the APP.
  • more sound banks can be stored in the server.
  • a speech synthesis system includes: a client device, where the client device includes: a query module, configured to query a list of available sound banks from a server when voice synthesis is required ,
  • the available sound library list includes information of a plurality of available sound banks, the available sound library includes a featured sound library, and an obtaining module is configured to acquire a sound library selected by the user according to the available sound library list, and download the sound library from the server.
  • User-selected sound library synthesis module for synthesizing text into speech using the downloaded sound library.
  • the speech synthesis system proposed by the embodiment of the second aspect of the present invention can reduce the volume of the APP by downloading the sound library from the server during speech synthesis, instead of directly including the sound bank in the APP, and additionally, the sound is included in the APP.
  • more sound banks can be stored in the server.
  • downloading the sound library on the server you can provide more choices for users.
  • An embodiment of the present invention further provides an electronic device, including: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when the one or more When the processor is executed: the method according to any of the first aspect of the invention is performed.
  • Embodiments of the present invention also provide a non-volatile computer storage medium having one or more modules stored when the one or more modules are executed: performing the first aspect of the present invention The method of any of the preceding claims.
  • FIG. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention.
  • FIG. 2 is a schematic flow chart of a method for voice synthesis according to another embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a specific example of a speech synthesis system in an embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of voice synthesis of a specific example in the embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of voice synthesis according to another specific example in the embodiment of the present invention.
  • FIG. 6 is a schematic flowchart of voice synthesis of another specific example in the embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of voice synthesis of another specific example in the embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a voice synthesis method according to an embodiment of the present invention, where the method includes:
  • the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library.
  • the sound library is included in the APP, in this embodiment, it is not necessary to include the sound bank in the APP, but to download from the server when the sound bank is needed.
  • SDK software development kit
  • the available sound banks in this embodiment include a featured sound bank. Of course, it can be understood that the available sound banks can also include existing common sound banks.
  • the featured sound library is pre-generated, a sound library for satisfying individual needs, a sound library different from the ordinary sound library, such as a children's sound library, or a user-defined sound library.
  • S12 Acquire a sound bank selected by the user according to the available sound library list, and download the user selected sound library from the server.
  • the list of available sound banks can be displayed to the user.
  • the information of each available sound library can be specifically displayed, for example, the generator information of the available sound library, the generation time, and the suitable The version of the offline language synthesis engine, the field to which the text to be synthesized belongs, the male or female voice or other characteristic sounds, sound quality, etc., thereby facilitating the user's choice.
  • the user can select information for one or more available sound banks based on the information presented.
  • the SDK can determine the available sound bank selected by the corresponding user, and download the available sound bank selected by the user from the server.
  • the information of the available sound library further includes link information.
  • the corresponding available sound bank can be downloaded according to the link information in the information.
  • the sound library can be used to implement speech synthesis.
  • the volume of the APP can be reduced.
  • the server can be included in the manner of including the sound bank in the APP.
  • FIG. 2 is a schematic flowchart of a method for synthesizing speech according to another embodiment of the present invention.
  • This embodiment provides an example for providing a user to select a special sound library, and the method includes:
  • S21 The server creates a featured sound library and corresponding characteristic sound library information, and stores the featured sound library and featured sounds. Library information.
  • a special sound library which can include:
  • a characteristic acoustic model Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank;
  • a characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
  • creating a characteristic acoustic model can include:
  • the sample size required to directly train the characteristic sound data to obtain the characteristic acoustic model is larger than the sample size required for the adaptive training of the existing acoustic model.
  • recording/collecting sound data of a certain size and specific timbre and performing artificial or automatic prosody labeling and boundary labeling, and training to obtain a characteristic acoustic model.
  • recording/collecting a small amount of sound data of a specific timbre and updating the existing acoustic model to a characteristic acoustic model through adaptive model training techniques.
  • obtaining an acoustic segment can include:
  • the training samples are segmented to obtain an acoustic segment.
  • recording/collecting sound data of a certain size and specific timbre and performing manual or automatic rhythm annotation and boundary labeling, and segmenting the acoustic segments.
  • acquiring sound data corresponding to a specific text may include:
  • the recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
  • the acquired voice can be compressed, and the compressed voice is compressed.
  • the sound data that is finally saved in the featured sound bank can be compressed.
  • different sound data can be obtained by reading different or different specific texts by different speakers, and then a plurality of specific texts can be stored correspondingly with the sound data to form a customized library.
  • the featured sound library information refers to relevant information generated for the featured sound library, for example, generator information, generation time, a version of a suitable offline language synthesis engine, a suitable field to which the text to be synthesized belongs, a male or female voice, or other distinctive sounds, Sound quality and so on.
  • the characteristic sound library can be stored in a storage module for storing the featured sound library (indicated by BOS cloud storage in FIG. 3).
  • the characteristic sound library information is stored in a storage module (represented by mysql cluster in FIG. 3) 33 for storing the characteristic sound library information.
  • the featured sound library information can be provided to the user as a query result in the subsequent process, and each query result can be used as one of the available sound library information in the available sound bank list.
  • S22 The SDK sends a query request to the server.
  • the SDK may send the query request when voice synthesis is required. For example, after the user opens the SDK and clicks a button for triggering voice synthesis, the SDK sends a query request to the server.
  • the query request sent by the SDK 34 can be sent to the ingress node of the server (represented by the physical room in FIG. 3) 35.
  • S23 The server obtains the query result according to the query request.
  • the query request may include a query condition, for example, a version, a domain, a featured voice, and the like of the voice synthesis engine.
  • the server After receiving the query request, the server obtains the query result that satisfies the query condition.
  • the query result can be cached. See Figure 3 for an example of storing query results to the memcached cluster 36.
  • the server when the server receives the query request, it can first query in the memcached cluster. If the query result that satisfies the query condition can be found, the query result can be obtained directly from the memcached cluster. Or, if you can't find the query result that meets the query condition in the memcached cluster, you can query it in the mysql cluster. When there are query results in the mysql cluster that satisfy the query condition, the query result is obtained from the mysql cluster, and The query results obtained from the mysql cluster are cached into the memcached cluster so that the query results can be obtained directly from the memcached cluster.
  • S24 The server obtains a list of available sound banks according to the query result.
  • the physical equipment room obtains the query result from the memcached cluster.
  • the storage address of the available sound library as the link information can be obtained from the BOS cloud storage, and the available sound library information can be used for each available sound library. Including: query results (such as version, domain, featured sounds, etc. for speech synthesis engine) and link information, The list of available sound banks can then be composed of information from multiple available sound banks.
  • S25 The server sends a list of available sound banks to the SDK.
  • the SDK acquires a sound bank selected by the user according to the available sound bank list, and downloads the sound bank selected by the user from the server.
  • the list is displayed to the user, and the user can select an available sound bank according to the displayed information.
  • the storage module storing the featured sound library can send the storage address of the featured sound bank as the link information to the portal node of the server, and then the entry node will obtain the characteristic sound from mysql.
  • the library information and the link information obtained from the storage module are used as information of the available sound bank, and are composed of a plurality of available sound library information to be sent to the SDK.
  • the storage module when the storage module sends the link information to the ingress node, the identifier and the link information of the featured sound library may be correspondingly sent.
  • the characteristic sound bank information is stored in the mysql
  • the identifier of the featured sound bank is stored correspondingly
  • the entry node is in the slave node.
  • the corresponding identifier and information of the characteristic sound library are obtained, and then the information obtained from mysql can be associated with the link information obtained from the storage module according to the identifier of the featured sound library.
  • the SDK downloads the selected sound bank from the server according to the link information included in the information selected by the user.
  • S27 The SDK uses the downloaded sound library to synthesize the text into speech.
  • the sound library can be used to synthesize the text into speech to realize speech synthesis.
  • speech synthesis can be performed according to the information in the downloaded featured sound library and different speech synthesis methods.
  • the using the downloaded sound library to synthesize the text into a voice includes:
  • the text is processed, acoustic parameters are acquired according to the processed text and the acoustic model, and corresponding acoustic segments are acquired according to the acoustic parameters, and Acoustic segments are spliced and combined to obtain synthesized speech; or,
  • the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or
  • the sound library includes an acoustic model, specific text, and corresponding sound data
  • the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired
  • the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
  • the specific content can be as follows:
  • the flow of speech synthesis may include:
  • the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
  • S45 Acquire corresponding acoustic segments in the featured sound library according to the acoustic parameters, perform stitching and synthesis on the acquired acoustic segments, and obtain synthesized speech corresponding to the text to be synthesized.
  • corresponding acoustic parameters can also be created, and then the acoustic parameters are stored corresponding to the acoustic segments in the featured sound bank, so that the corresponding acoustic segments can be found according to the acoustic parameters during speech synthesis.
  • the segments can be spliced to obtain a text-corresponding speech for speech synthesis.
  • the flow of speech synthesis may include:
  • the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
  • S55 Synthesizing the vocoder parameters according to the acoustic parameters, and acquiring the synthesized speech corresponding to the text to be synthesized.
  • the vocoder is a device capable of generating sound according to acoustic parameters, so that the device can output synthesized speech.
  • the process of speech synthesis may include:
  • the specific text and the corresponding sound data may be completely consistent or consistent within the error range.
  • the sound recorded in a certain sound library may correspond to "Be careful, immediately traffic lights, red light will be fine !” The sound of such content.
  • the sound data corresponding to the specific text can be acquired.
  • the sound data can be used as the synthesized speech to be synthesized later after the sound data is acquired.
  • the sound data corresponding to the specific text is stored after the compression processing in the featured sound library, after the corresponding sound data is acquired in the featured sound library, the obtained sound data may be decompressed and decompressed.
  • the processed sound data is used as a synthesized voice.
  • S61, S64, and S65 refer to related processes of existing speech synthesis.
  • the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
  • S67 Perform vocoder parameter synthesis according to the acoustic parameters, and obtain synthesized speech corresponding to the text to be synthesized.
  • the vocoder is a device capable of generating sound according to acoustic parameters, so that the device can output synthesized speech.
  • the flow of speech synthesis may include:
  • the specific text and the corresponding sound data may be completely consistent or consistent within the error range.
  • the sound recorded in a certain sound library may correspond to “Be careful, immediately traffic lights, red light will be fined Money! "The sound of such content.
  • the sound data corresponding to the specific text can be acquired.
  • the sound data can be used as the synthesized speech to be synthesized later after the sound data is acquired.
  • the sound data corresponding to the specific text is stored after the compression processing in the featured sound library, after the corresponding sound data is acquired in the featured sound library, the obtained sound data may be decompressed and decompressed.
  • the processed sound data is used as a synthesized voice.
  • S71, S74, and S75 refer to related processes of existing speech synthesis.
  • the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
  • S77 Acquire corresponding acoustic segments in the featured sound library according to the acoustic parameters, perform stitching and combining on the acquired acoustic segments, and obtain synthesized speech corresponding to the text to be synthesized.
  • corresponding acoustic parameters can also be created, and then the acoustic parameters are stored corresponding to the acoustic segments in the featured sound bank, so that the corresponding acoustic segments can be found according to the acoustic parameters during speech synthesis.
  • the segments can be spliced to obtain a text-corresponding speech for speech synthesis.
  • the volume of the APP can be reduced.
  • the server can be included in the manner of including the sound bank in the APP.
  • FIG. 8 is a schematic structural diagram of a voice synthesizing system according to another embodiment of the present invention.
  • the system includes: a client device 81, and the client device 81 includes:
  • the query module 811 is configured to query, from the server, a list of available sound banks when the voice synthesis is required, where the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;
  • the sound library is included in the APP, in this embodiment, it is not necessary to include the sound in the APP. Library, but download from the server when a library is needed.
  • SDK software development kit
  • the available sound banks in this embodiment include a featured sound bank. Of course, it can be understood that the available sound banks can also include existing common sound banks.
  • the featured sound library is pre-generated, a sound library for satisfying individual needs, a sound library different from the ordinary sound library, such as a children's sound library, or a user-defined sound library.
  • the obtaining module 812 is configured to obtain a sound bank selected by the user according to the available sound library list, and download a sound bank selected by the user from the server;
  • the list of available sound banks can be displayed to the user.
  • the information of each available sound library can be specifically displayed, for example, the generator information of the available sound library, the generation time, and the suitable The version of the offline language synthesis engine, the field to which the text to be synthesized belongs, the male or female voice or other characteristic sounds, sound quality, etc., thereby facilitating the user's choice.
  • the user can select information for one or more available sound banks based on the information presented.
  • the SDK can determine the available sound bank selected by the corresponding user, and download the available sound bank selected by the user from the server according to the selected information.
  • the information of the available sound library further includes link information. After the user selects the information of the available sound library, the corresponding available sound library can be downloaded according to the link information in the selected information.
  • the synthesizing module 813 is configured to synthesize the text into a voice by using the downloaded sound bank.
  • the sound library can be used to implement speech synthesis.
  • the system further includes: a server device 82, the server device includes: a creating module 821 for creating a featured sound library, and the creating module 821 is specifically configured to:
  • a characteristic acoustic model Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank;
  • a characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
  • the creation module 821 is used to create a characteristic acoustic model, including:
  • the sample size required to directly train the characteristic sound data to obtain the characteristic acoustic model is larger than the sample size required for the adaptive training of the existing acoustic model.
  • recording/collecting sound data of a certain size and specific timbre and performing artificial or automatic prosody labeling and boundary labeling, and training to obtain a characteristic acoustic model.
  • recording/collecting a small amount of sound data of a specific timbre and updating the existing acoustic model to a characteristic acoustic model through adaptive model training techniques.
  • obtaining an acoustic segment can include:
  • the training samples are segmented to obtain an acoustic segment.
  • recording/collecting sound data of a certain size and specific timbre and performing manual or automatic rhythm annotation and boundary labeling, and segmenting the acoustic segments.
  • the creating module 821 is configured to acquire sound data corresponding to a specific text, including:
  • the recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
  • the acquired recited voice may be compressed, and the compressed voice is used as the sound data finally stored in the featured sound bank.
  • different sound data can be obtained by reading different or different specific texts by different speakers, and then a plurality of specific texts can be stored correspondingly with the sound data to form a customized library.
  • the system further includes: a first cluster system 822 and a second cluster system 823 at the server end, where the query module 811 is specifically configured to:
  • the query request may include a query condition, for example, a version, a domain, and a feature of the speech synthesis engine.
  • a query condition for example, a version, a domain, and a feature of the speech synthesis engine.
  • the query result can be cached. See Figure 3 for an example of storing query results to the memcached cluster 36.
  • the server when the server receives the query request, it can first query in the memcached cluster. If the query result that satisfies the query condition can be found, the query result can be obtained directly from the memcached cluster. Or, if you can't find the query result that meets the query condition in the memcached cluster, you can query it in the mysql cluster. When there are query results in the mysql cluster that satisfy the query condition, the query result is obtained from the mysql cluster, and The query results obtained from the mysql cluster are cached into the memcached cluster so that the query results can be obtained directly from the memcached cluster.
  • the first cluster system here corresponds to the memcached cluster in the method embodiment.
  • the physical equipment room obtains the query result from the memcached cluster.
  • the storage address of the available sound library as the link information can be obtained from the BOS cloud storage, and the available sound library information can be used for each available sound library.
  • query results such as versions, fields, featured sounds, etc. for speech synthesis engines
  • link information after which the list of available sound banks can be composed of information from multiple available sound banks.
  • the information of the available sound library includes: corresponding generated information after creating an available sound library, the system further comprising: a second cluster system 823 located at the server, the second cluster system 823 is configured to store information corresponding to the generated generated sound library.
  • the corresponding generated information corresponds to the characteristic sound library information in the method embodiment, and the featured sound database information can be provided to the user as a query result in the subsequent process, and each query result can be used as each available sound in the available sound library list.
  • a kind of information in the library information is possible.
  • the information of the available sound library includes: link information of the available sound library
  • the system further includes: a storage module 824 located at the server, the storage module 824 is configured to store the generated available sound Library, and the storage address of the available sound bank as the link information.
  • the information generated correspondingly after the creation of the available sound bank corresponds to the characteristic sound library information in the above embodiment
  • the featured sound library information refers to related information generated for the featured sound library, for example, the generator information, the generation time, and the suitable
  • the characteristic sound library can be stored in a storage module for storing the featured sound library (indicated by BOS cloud storage in FIG. 3).
  • the characteristic sound library information is stored in a storage module (represented by mysql cluster in FIG. 3) 33 for storing the characteristic sound library information.
  • the featured sound library information can be provided to the user as a query result in the subsequent process, and each query result can be used as one of the available sound library information in the available sound bank list.
  • the second cluster system here corresponds to the mysql cluster in the method embodiment.
  • Storage module pair here The BOS cloud storage in the method embodiment.
  • the storage module storing the featured sound library can send the storage address of the featured sound bank as the link information to the portal node of the server, and then the entry node will obtain the characteristic sound from mysql.
  • the library information and the link information obtained from the storage module are used as information of the available sound bank, and are composed of a plurality of available sound library information to be sent to the SDK.
  • the storage module when the storage module sends the link information to the ingress node, the identifier and the link information of the featured sound library may be correspondingly sent.
  • the characteristic sound bank information is stored in the mysql
  • the identifier of the featured sound bank is stored correspondingly
  • the entry node is in the slave node.
  • the corresponding identifier and information of the characteristic sound library are obtained, and then the information obtained from mysql can be associated with the link information obtained from the storage module according to the identifier of the featured sound library.
  • the SDK downloads the selected sound bank from the server according to the link information in the information selected by the user.
  • the synthesizing module 813 is specifically configured to:
  • the text is processed, acoustic parameters are acquired according to the processed text and the acoustic model, and corresponding acoustic segments are acquired according to the acoustic parameters, and Acoustic segments are spliced and combined to obtain synthesized speech; or,
  • the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or
  • the sound library includes an acoustic model, specific text, and corresponding sound data
  • the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired
  • the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
  • the volume of the APP can be reduced.
  • the server can be included in the manner of including the sound bank in the APP.
  • An embodiment of the present invention further provides an electronic device, including: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when the one or more When the processor executes:
  • the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;
  • the text is synthesized into speech using the downloaded sound bank.
  • the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;
  • the text is synthesized into speech using the downloaded sound bank.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
  • each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A voice synthesis method and system. The voice synthesis method comprises: when voice synthesis needs to be performed, querying an available sound bank list from a serving end, wherein the available sound bank list comprises information about a plurality of available sound banks, and the available sound banks comprise a characteristic sound bank (S11); acquiring a sound bank selected by a user according to the available sound bank list, and downloading the sound bank selected by the user from the serving end (S12); and using the downloaded sound bank to synthesize text into voice (S13). The method can reduce the volume of an off-line voice synthesis APP, and can provide more choices for a user, thereby realizing personalized voice synthesis.

Description

语音合成方法和系统Speech synthesis method and system
相关申请的交叉引用Cross-reference to related applications
本申请要求百度在线网络技术(北京)有限公司于2015年07月24日提交的、发明名称为“语音合成方法和系统”的、中国专利申请号“201510441079.6”的优先权。This application claims the priority of Chinese Patent Application No. 201510441079.6, filed on Jul. 24, 2015, entitled "Speech Synthesis Method and System" by Baidu Online Network Technology (Beijing) Co., Ltd.
技术领域Technical field
本发明涉及语音处理技术领域,尤其涉及一种语音合成方法和系统。The present invention relates to the field of voice processing technologies, and in particular, to a voice synthesis method and system.
背景技术Background technique
现有技术中,用户在下载离线语音合成应用程序(APP)时,该APP内会包含一个或两个音库,用户在使用该APP时,会选择一种音库,之后该APP使用用户选择的音库对要播放的文本进行语音合成(Text To Speech,TTS)。In the prior art, when downloading an offline voice synthesis application (APP), the user may include one or two sound banks, and when the user uses the application, a sound library is selected, and then the application uses the user selection. The sound library performs Text To Speech (TTS) on the text to be played.
但是,现有技术的方案,一方面在APP中包含音库,由于音库文件一般都比较大,会造成APP的体积较大,另一方面APP包含的音库种类有限,致使用户选择空间有限。However, the prior art solution, on the one hand, includes a sound bank in the APP. Since the sound file is generally large, the size of the APP is large, and on the other hand, the type of the sound library included in the APP is limited, resulting in limited user selection space. .
发明内容Summary of the invention
本发明旨在至少在一定程度上解决相关技术中的技术问题之一。The present invention aims to solve at least one of the technical problems in the related art to some extent.
为此,本发明的一个目的在于提出一种语音合成方法,该方法可以降低离线语音合成APP的体积,并且可以为用户提供更多选择,实现个性化语音合成。To this end, it is an object of the present invention to provide a speech synthesis method that can reduce the volume of offline speech synthesis APP and provide users with more choices for personalized speech synthesis.
本发明的另一个目的在于提出一种语音合成系统。Another object of the present invention is to provide a speech synthesis system.
为达到上述目的,本发明第一方面实施例提出的语音合成方法,包括:在需要语音合成时,从服务端查询可用音库列表,所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库;获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库;采用下载的音库,将文本合成为语音。In order to achieve the above object, a speech synthesis method according to the first aspect of the present invention includes: when a speech synthesis is required, querying a list of available sound banks from a server, where the available sound library list includes information of a plurality of available sound banks. The available sound library includes a featured sound library; the sound library selected by the user according to the available sound library list is obtained, and the sound bank selected by the user is downloaded from the server; and the downloaded sound library is used to synthesize the text into a voice.
本发明第一方面实施例提出的语音合成方法,通过在语音合成时从服务端下载音库,而不是直接在APP中包含音库,可以降低APP的体积,另外,相对于在APP内包含音库的方式,服务端内可以存储更多的音库,通过在服务端下载音库,可以为用户提供更多的选择,通过可用音库内包括特色音库,可以满足用户个性化需求,提升用户体验。The speech synthesis method proposed by the embodiment of the first aspect of the present invention can reduce the volume of the APP by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, and additionally, the sound is included in the APP. In the way of the library, more sound banks can be stored in the server. By downloading the sound library on the server, you can provide more choices for users. By including the featured sound library in the available sound library, you can meet the individual needs of the user and improve. user experience.
为达到上述目的,本发明第二方面实施例提出的语音合成系统,包括:客户端装置,所述客户端装置包括:查询模块,用于在需要语音合成时,从服务端查询可用音库列表, 所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库;获取模块,用于获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库;合成模块,用于采用下载的音库,将文本合成为语音。In order to achieve the above object, a speech synthesis system according to a second aspect of the present invention includes: a client device, where the client device includes: a query module, configured to query a list of available sound banks from a server when voice synthesis is required , The available sound library list includes information of a plurality of available sound banks, the available sound library includes a featured sound library, and an obtaining module is configured to acquire a sound library selected by the user according to the available sound library list, and download the sound library from the server. User-selected sound library; synthesis module for synthesizing text into speech using the downloaded sound library.
本发明第二方面实施例提出的语音合成系统,通过在语音合成时从服务端下载音库,而不是直接在APP中包含音库,可以降低APP的体积,另外,相对于在APP内包含音库的方式,服务端内可以存储更多的音库,通过在服务端下载音库,可以为用户提供更多的选择,通过可用音库内包括特色音库,可以满足用户个性化需求,提升用户体验。The speech synthesis system proposed by the embodiment of the second aspect of the present invention can reduce the volume of the APP by downloading the sound library from the server during speech synthesis, instead of directly including the sound bank in the APP, and additionally, the sound is included in the APP. In the way of the library, more sound banks can be stored in the server. By downloading the sound library on the server, you can provide more choices for users. By including the featured sound library in the available sound library, you can meet the individual needs of the user and improve. user experience.
本发明实施例还提出了一种电子设备,包括:一个或者多个处理器;存储器;一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时:执行如本发明第一方面实施例任一项所述的方法。An embodiment of the present invention further provides an electronic device, including: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when the one or more When the processor is executed: the method according to any of the first aspect of the invention is performed.
本发明实施例还提出了一种非易失性计算机存储介质,所述计算机存储介质存储有一个或者多个模块,当所述一个或者多个模块被执行时:执行如本发明第一方面实施例任一项所述的方法。Embodiments of the present invention also provide a non-volatile computer storage medium having one or more modules stored when the one or more modules are executed: performing the first aspect of the present invention The method of any of the preceding claims.
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。The additional aspects and advantages of the invention will be set forth in part in the description which follows.
附图说明DRAWINGS
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1是本发明一实施例提出的语音合成方法的流程示意图;1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;
图2是本发明另一实施例提出的语音合成的方法的流程示意图;2 is a schematic flow chart of a method for voice synthesis according to another embodiment of the present invention;
图3是本发明实施例中语音合成系统一种具体示例的示意图;3 is a schematic diagram of a specific example of a speech synthesis system in an embodiment of the present invention;
图4是本发明实施例中一种具体示例的语音合成的流程示意图;4 is a schematic flowchart of voice synthesis of a specific example in the embodiment of the present invention;
图5是本发明实施例中另一种具体示例的语音合成的流程示意图;FIG. 5 is a schematic flowchart of voice synthesis according to another specific example in the embodiment of the present invention; FIG.
图6是本发明实施例中另一种具体示例的语音合成的流程示意图;6 is a schematic flowchart of voice synthesis of another specific example in the embodiment of the present invention;
图7是本发明实施例中另一种具体示例的语音合成的流程示意图;7 is a schematic flowchart of voice synthesis of another specific example in the embodiment of the present invention;
图8是本发明另一实施例提出的语音合成系统的结构示意图;FIG. 8 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention; FIG.
图9是本发明另一实施例提出的语音合成系统的结构示意图。FIG. 9 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention.
具体实施方式detailed description
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的模块或具有相同或类似功能的模块。下面通过参考附图描 述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。相反,本发明的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the accompanying drawings, in which the same or similar reference numerals indicate the same or similar modules or modules having the same or similar functions. The following is described by reference to the drawings The embodiments described are illustrative only and are not to be construed as limiting the invention. Rather, the invention is to cover all modifications, modifications and equivalents within the spirit and scope of the appended claims.
图1是本发明一实施例提出的语音合成方法的流程示意图,该方法包括:FIG. 1 is a schematic flowchart of a voice synthesis method according to an embodiment of the present invention, where the method includes:
S11:在需要语音合成时,从服务端查询可用音库列表,所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库。S11: When voice synthesis is required, the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library.
与现有技术中直接在APP内包含音库不同的是,本实施例中,不需要在APP内包含音库,而是在需要音库时从服务端下载。Different from the prior art that the sound library is included in the APP, in this embodiment, it is not necessary to include the sound bank in the APP, but to download from the server when the sound bank is needed.
例如,客户端上APP对应的软件开发工具包(Software Development Kit,SDK)向服务端发送查询请求,该查询请求用于请求可用音库列表,服务端接收到该查询请求后获取可用音库列表,并将获取的可用音库列表发送给SDK。For example, a software development kit (SDK) corresponding to the APP on the client sends a query request to the server, and the query request is used to request a list of available sound banks, and the server obtains a list of available sound banks after receiving the query request. And send the obtained list of available sound banks to the SDK.
本实施例中的可用音库包括特色音库,当然,可以理解的是,可用音库还可以包括已有的普通音库。The available sound banks in this embodiment include a featured sound bank. Of course, it can be understood that the available sound banks can also include existing common sound banks.
其中,特色音库是预先生成的,用于满足个性化需求的音库,与普通音库不同的音库,例如童声音库,或者用户定制的音库等。Among them, the featured sound library is pre-generated, a sound library for satisfying individual needs, a sound library different from the ordinary sound library, such as a children's sound library, or a user-defined sound library.
S12:获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库。S12: Acquire a sound bank selected by the user according to the available sound library list, and download the user selected sound library from the server.
SDK从服务端获取可用音库列表后,可以将可用音库列表展示给用户,在展示时,可以具体展示每个可用音库的信息,例如,可用音库的生成者信息,生成时间、适合的离线语言合成引擎的版本、适合的待合成文本所属的领域、男声或女声或其他特色声音、音质等,从而方便用户选择。After the SDK obtains the list of available sound banks from the server, the list of available sound banks can be displayed to the user. When displaying, the information of each available sound library can be specifically displayed, for example, the generator information of the available sound library, the generation time, and the suitable The version of the offline language synthesis engine, the field to which the text to be synthesized belongs, the male or female voice or other characteristic sounds, sound quality, etc., thereby facilitating the user's choice.
用户可以根据展示的信息,选择一个或多个可用音库的信息。The user can select information for one or more available sound banks based on the information presented.
用户在选择可用音库的信息后,SDK可以确定对应的用户选择的可用音库,并从服务端下载用户选择的可用音库。例如,可用音库的信息中还包括链接信息,用户选择可用音库的信息后,可以根据该信息中的链接信息下载相应的可用音库。After the user selects the information of the available sound bank, the SDK can determine the available sound bank selected by the corresponding user, and download the available sound bank selected by the user from the server. For example, the information of the available sound library further includes link information. After the user selects the information of the available sound library, the corresponding available sound bank can be downloaded according to the link information in the information.
S13:采用下载的音库,将文本合成为语音。S13: Synthesize the text into a voice using the downloaded sound bank.
SDK从服务端下载音库后,就可以采用该音库实现语音合成。After the SDK downloads the sound library from the server, the sound library can be used to implement speech synthesis.
本实施例中,通过在语音合成时从服务端下载音库,而不是直接在APP中包含音库,可以降低APP的体积,另外,相对于在APP内包含音库的方式,服务端内可以存储更多的音库,通过在服务端下载音库,可以为用户提供更多的选择,通过可用音库内包括特色音库,可以满足用户个性化需求,提升用户体验。In this embodiment, by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, the volume of the APP can be reduced. In addition, the server can be included in the manner of including the sound bank in the APP. By storing more sound banks, you can provide users with more choices by downloading the sound library on the server. By including the featured sound library in the available sound library, you can meet the individual needs of users and enhance the user experience.
图2是本发明另一实施例提出的语音合成的方法的流程示意图,本实施例以提供和用户选择特色音库为例,该方法包括:2 is a schematic flowchart of a method for synthesizing speech according to another embodiment of the present invention. This embodiment provides an example for providing a user to select a special sound library, and the method includes:
S21:服务端创建特色音库及对应的特色音库信息,以及,存储所述特色音库和特色音 库信息。S21: The server creates a featured sound library and corresponding characteristic sound library information, and stores the featured sound library and featured sounds. Library information.
其中,创建特色音库,可以包括:Among them, create a special sound library, which can include:
建立特色声学模型和获取声学片断,由所述特色声学模型和所述声学片断组成特色音库;或者,Establishing a characteristic acoustic model and acquiring an acoustic segment, and the characteristic acoustic model and the acoustic segment constitute a characteristic sound bank; or
建立特色声学模型,由所述特色声学模型组成特色音库;或者,Establishing a characteristic acoustic model, and the characteristic acoustic model is composed of the characteristic acoustic library; or
获取与特定文本对应的声音数据,由所述特定文本与所述声音数据组成特色音库;或者,Obtaining sound data corresponding to a specific text, and the specific text and the sound data constitute a characteristic sound bank; or
建立特色声学模型、获取声学片断,以及,获取与特定文本对应的声音数据,由所述特色声学模型,声学片断,以及,所述特定文本与所述声音数据组成特色音库;或者,Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank; or
建立特色声学模型,获取与特定文本对应的声音数据,由所述特色声学模型,以及,所述特定文本与所述声音数据组成特色音库。A characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
一些实施例中,建立特色声学模型,可以包括:In some embodiments, creating a characteristic acoustic model can include:
获取特色声音数据,并对所述特色声音数据进行训练,建立特色声学模型;或者,Obtaining characteristic sound data, and training the characteristic sound data to establish a characteristic acoustic model; or
获取已有的声学模型和特色声音数据,根据所述特色声音数据对所述已有的声学模型进行自适应训练,建立特色声学模型。Obtaining an existing acoustic model and characteristic sound data, and adaptively training the existing acoustic model according to the characteristic sound data to establish a characteristic acoustic model.
其中,直接对特色声音数据进行训练得到特色声学模型时需要的样本量大于对已有的声学模型进行自适应训练时需要的样本量。Among them, the sample size required to directly train the characteristic sound data to obtain the characteristic acoustic model is larger than the sample size required for the adaptive training of the existing acoustic model.
例如,录制/收集一定规模的、特定音色的声音数据,并进行人工或者自动的韵律标注和边界标注,训练得到特色声学模型。或者,利用现有的声学模型,录制/收集少量的特定音色的声音数据,通过自适应模型训练技术,将现有声学模型更新为特色声学模型。For example, recording/collecting sound data of a certain size and specific timbre, and performing artificial or automatic prosody labeling and boundary labeling, and training to obtain a characteristic acoustic model. Alternatively, using existing acoustic models, recording/collecting a small amount of sound data of a specific timbre, and updating the existing acoustic model to a characteristic acoustic model through adaptive model training techniques.
一些实施例中,获取声学片断,可以包括:In some embodiments, obtaining an acoustic segment can include:
对训练样本进行切分得到声学片断。The training samples are segmented to obtain an acoustic segment.
例如,录制/收集一定规模的、特定音色的声音数据,并进行人工或者自动的韵律标注和边界标注,切分得到声学片断。For example, recording/collecting sound data of a certain size and specific timbre, and performing manual or automatic rhythm annotation and boundary labeling, and segmenting the acoustic segments.
一些实施例中,获取与特定文本对应的声音数据,可以包括:In some embodiments, acquiring sound data corresponding to a specific text may include:
选取要朗诵的特定文本;Select the specific text you want to recite;
获取特定发音人对所述特定文本的朗诵语音;Obtaining a recited voice of a particular speaker to the particular text;
将所述朗诵语音或者对所述朗诵语音进行压缩处理后的语音作为与所述特定文本对应的声音数据。The recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
例如,请特定的发音人有感情的朗诵特定文本,获取相应的声音数据,实现声音数据的定制化。For example, ask a specific speaker to recite specific texts emotionally, obtain corresponding sound data, and customize the sound data.
可选的,为了节省空间,可以对获取的朗诵语音进行压缩处理,将压缩处理后的语音 作为最终保存在特色音库内的声音数据。Optionally, in order to save space, the acquired voice can be compressed, and the compressed voice is compressed. As the sound data that is finally saved in the featured sound bank.
另外,可以通过不同的发音人对相同或不同的特定文本进行朗诵获取不同的声音数据,之后可以将多个特定文本与声音数据分别对应存储,组成定制库。In addition, different sound data can be obtained by reading different or different specific texts by different speakers, and then a plurality of specific texts can be stored correspondingly with the sound data to form a customized library.
特色音库信息是指对于特色音库生成的相关信息,例如,生成者信息,生成时间,适合的离线语言合成引擎的版本、适合的待合成文本所属的领域、男声或女声或其他特色声音、音质等。The featured sound library information refers to relevant information generated for the featured sound library, for example, generator information, generation time, a version of a suitable offline language synthesis engine, a suitable field to which the text to be synthesized belongs, a male or female voice, or other distinctive sounds, Sound quality and so on.
在创建特色音库和特色音库信息后,可以存储。例如,参见图3,创建模块(图3中用manage console表示)31创建特色音库后,可以将特色音库存储在用于存储特色音库的存储模块(图3中用BOS云存储表示)32中,在创建特色音库信息后,将特色音库信息存储在用于存储特色音库信息的存储模块(图3中用mysql集群表示)33中。另外,特色音库信息在后续流程可以作为查询结果提供给用户,每个查询结果可以作为可用音库列表中每个可用音库信息中一种信息。After creating the featured library and featured library information, it can be stored. For example, referring to FIG. 3, after the creation module (represented by the manage console in FIG. 3) 31 creates a special sound library, the characteristic sound library can be stored in a storage module for storing the featured sound library (indicated by BOS cloud storage in FIG. 3). In the 32, after the characteristic sound library information is created, the characteristic sound library information is stored in a storage module (represented by mysql cluster in FIG. 3) 33 for storing the characteristic sound library information. In addition, the featured sound library information can be provided to the user as a query result in the subsequent process, and each query result can be used as one of the available sound library information in the available sound bank list.
S22:SDK向服务端发送查询请求。S22: The SDK sends a query request to the server.
其中,SDK可以在需要语音合成时发送该查询请求,例如,用户打开SDK点击用于触发语音合成的按钮后,SDK向服务端发送查询请求。The SDK may send the query request when voice synthesis is required. For example, after the user opens the SDK and clicks a button for triggering voice synthesis, the SDK sends a query request to the server.
参见图3,SDK 34发送的查询请求可以先发送到服务端的入口节点(图3中用物理机房表示)35处。Referring to FIG. 3, the query request sent by the SDK 34 can be sent to the ingress node of the server (represented by the physical room in FIG. 3) 35.
S23:服务端根据查询请求获取查询结果。S23: The server obtains the query result according to the query request.
其中,查询请求中可以包含查询条件,例如,语音合成引擎的版本、领域、特色声音等,服务端接收到查询请求后,获取满足查询条件的查询结果。The query request may include a query condition, for example, a version, a domain, a featured voice, and the like of the voice synthesis engine. After receiving the query request, the server obtains the query result that satisfies the query condition.
为了应对SDK端可能出现的爆发式查询请求,可以对查询结果进行缓存。参见图3,以将查询结果存储到memcached集群36为例。In order to cope with the explosive query request that may appear on the SDK side, the query result can be cached. See Figure 3 for an example of storing query results to the memcached cluster 36.
因此,当服务端接收到查询请求后,可以先在memcached集群中查询,如果可以查找到满足查询条件的查询结果,则可以直接从memcached集群中获取查询结果。或者,如果在memcached集群中不能查找到满足查询条件的查询结果,则可以再在mysql集群中查询,当mysql集群中存在满足查询条件的查询结果时,从mysql集群中获取查询结果,并且,可以将从mysql集群获取的查询结果缓存到memcached集群中,以便后续可以直接从memcached集群获取查询结果。Therefore, when the server receives the query request, it can first query in the memcached cluster. If the query result that satisfies the query condition can be found, the query result can be obtained directly from the memcached cluster. Or, if you can't find the query result that meets the query condition in the memcached cluster, you can query it in the mysql cluster. When there are query results in the mysql cluster that satisfy the query condition, the query result is obtained from the mysql cluster, and The query results obtained from the mysql cluster are cached into the memcached cluster so that the query results can be obtained directly from the memcached cluster.
S24:服务端根据查询结果获取可用音库列表。S24: The server obtains a list of available sound banks according to the query result.
例如,参见图3,物理机房从memcached集群内获取查询结果,另外,还可以从BOS云存储获取可用音库的作为链接信息的存储地址,则对应每个可用音库,可用音库的信息可以包括:查询结果(如适合语音合成引擎的版本、领域、特色声音等)和链接信息, 之后,可以由多个可用音库的信息组成可用音库列表。For example, referring to FIG. 3, the physical equipment room obtains the query result from the memcached cluster. In addition, the storage address of the available sound library as the link information can be obtained from the BOS cloud storage, and the available sound library information can be used for each available sound library. Including: query results (such as version, domain, featured sounds, etc. for speech synthesis engine) and link information, The list of available sound banks can then be composed of information from multiple available sound banks.
S25:服务端将可用音库列表发送给SDK。S25: The server sends a list of available sound banks to the SDK.
S26:SDK获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库。S26: The SDK acquires a sound bank selected by the user according to the available sound bank list, and downloads the sound bank selected by the user from the server.
例如,SDK获取到可用音库列表后,将该列表展示给用户,用户根据展示的信息可以选择可用音库。For example, after the SDK obtains the list of available sound banks, the list is displayed to the user, and the user can select an available sound bank according to the displayed information.
另外,参见图3,在将特色音库存储后,存储特色音库的存储模块可以将特色音库的存储地址作为链接信息发送给服务端的入口节点,之后,入口节点将从mysql获取的特色音库信息及从存储模块获取的链接信息作为可用音库的信息,并由多个可用音库的信息组成可用音库列表发送给SDK。In addition, referring to FIG. 3, after storing the featured sound library, the storage module storing the featured sound library can send the storage address of the featured sound bank as the link information to the portal node of the server, and then the entry node will obtain the characteristic sound from mysql. The library information and the link information obtained from the storage module are used as information of the available sound bank, and are composed of a plurality of available sound library information to be sent to the SDK.
其中,存储模块向入口节点发送链接信息时可以对应发送特色音库的标识与链接信息,在mysql中存储特色音库信息时,将特色音库的标识与信息对应存储,以及,入口节点在从mysql中获取特色音库信息时,对应获取特色音库的标识与信息,之后,可以根据特色音库的标识将从mysql中获取的信息与从存储模块中获取的链接信息关联。Wherein, when the storage module sends the link information to the ingress node, the identifier and the link information of the featured sound library may be correspondingly sent. When the characteristic sound bank information is stored in the mysql, the identifier of the featured sound bank is stored correspondingly, and the entry node is in the slave node. When obtaining the characteristic sound bank information in mysql, the corresponding identifier and information of the characteristic sound library are obtained, and then the information obtained from mysql can be associated with the link information obtained from the storage module according to the identifier of the featured sound library.
当用户选择可用音库的信息后,SDK根据用户选择的信息中包括的链接信息从服务端下载选择的音库。After the user selects the information of the available sound bank, the SDK downloads the selected sound bank from the server according to the link information included in the information selected by the user.
S27:SDK采用下载的音库,将文本合成为语音。S27: The SDK uses the downloaded sound library to synthesize the text into speech.
SDK获取音库后,可以采用该音库将文本合成为语音,实现语音合成。After the SDK obtains the sound library, the sound library can be used to synthesize the text into speech to realize speech synthesis.
在语音合成时,可以根据下载的特色音库内信息和不同的语音合成方式进行语音合成。In speech synthesis, speech synthesis can be performed according to the information in the downloaded featured sound library and different speech synthesis methods.
可选的,所述采用下载的音库,将文本合成为语音,包括:Optionally, the using the downloaded sound library to synthesize the text into a voice includes:
当所述音库内包括声学模型和声学片断时,对文本进行处理,根据处理后的文本和所述声学模型获取声学参数,并根据所述声学参数获取对应的声学片断,以及,对获取的声学片断进行拼接合成,获取合成语音;或者,When an acoustic model and an acoustic segment are included in the sound library, the text is processed, acoustic parameters are acquired according to the processed text and the acoustic model, and corresponding acoustic segments are acquired according to the acoustic parameters, and Acoustic segments are spliced and combined to obtain synthesized speech; or,
当所述音库内包括声学模型时,对文本进行处理,根据处理后的文本和所述声学模型获取声学参数,根据所述声学参数进行声码器参数合成,获取合成语音;或者,When the acoustic model is included in the sound library, the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or
当所述音库内包括声学模型、特定文本与对应的声音数据时,对文本进行预处理,在所述音库内存在与预处理后的文本一致的特定文本时,获取与所述特定文本对应的声音数据,将所述声音数据或者对所述声音数据进行解压缩处理后的声音数据作为合成语音。When the sound library includes an acoustic model, specific text, and corresponding sound data, the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired Corresponding sound data, the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
以音库是特色音库为例,具体内容可以如下:Take the sound library as a special sound library as an example. The specific content can be as follows:
一些实施例中,参见图4,语音合成的流程可以包括:In some embodiments, referring to FIG. 4, the flow of speech synthesis may include:
S41:对待合成的文本进行文本预处理。 S41: Text preprocessing of the text to be synthesized.
S42:对预处理后的文本进行文本分析。S42: Perform text analysis on the preprocessed text.
S43:对文本分析后的文本韵律预测。S43: Prediction of text prosody after text analysis.
S41-S43的具体内容可以参见已有语音合成的相关流程。For details of S41-S43, refer to related processes of existing speech synthesis.
S44:根据韵律预测后的文本以及特色音库内的特色声学模型,进行声学参数生成,生成声学参数。S44: Acquire acoustic parameters according to the prosperously predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.
与现有技术不同的是,本实施例采用的特色声学模型而不是已有的声学模型,而在确定声学模型后,生成声学参数的流程可以参见已有方式。Different from the prior art, the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
S45:根据声学参数在特色音库内获取对应的声学片断,对获取的声学片断进行拼接合成,获取待合成的文本对应的合成语音。S45: Acquire corresponding acoustic segments in the featured sound library according to the acoustic parameters, perform stitching and synthesis on the acquired acoustic segments, and obtain synthesized speech corresponding to the text to be synthesized.
其中,在创建声学片断时还可以创建相应的声学参数,之后在特色音库内将声学参数与声学片断对应存储,从而在语音合成时可以根据声学参数查找到对应的声学片断。Wherein, when the acoustic segment is created, corresponding acoustic parameters can also be created, and then the acoustic parameters are stored corresponding to the acoustic segments in the featured sound bank, so that the corresponding acoustic segments can be found according to the acoustic parameters during speech synthesis.
在获取声学片断后,可以对这些片断进行拼接,从而得到文本对应的语音,实现语音合成。After the acoustic segments are acquired, the segments can be spliced to obtain a text-corresponding speech for speech synthesis.
一些实施例中,参见图5,语音合成的流程可以包括:In some embodiments, referring to FIG. 5, the flow of speech synthesis may include:
S51:对待合成的文本进行文本预处理。S51: Perform text preprocessing on the synthesized text.
S52:对预处理后的文本进行文本分析。S52: Perform text analysis on the preprocessed text.
S53:对文本分析后的文本韵律预测。S53: Prediction of text prosody after text analysis.
S51-S53的具体内容可以参见已有语音合成的相关流程。For details of S51-S53, refer to related processes of existing speech synthesis.
S54:根据韵律预测后的文本以及特色音库内的特色声学模型,进行声学参数生成,生成声学参数。S54: Acquire acoustic parameters according to the prosperous predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.
与现有技术不同的是,本实施例采用的特色声学模型而不是已有的声学模型,而在确定声学模型后,生成声学参数的流程可以参见已有方式。Different from the prior art, the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
S55:根据声学参数进行声码器参数合成,获取待合成的文本对应的合成语音。S55: Synthesizing the vocoder parameters according to the acoustic parameters, and acquiring the synthesized speech corresponding to the text to be synthesized.
其中,声码器是一种能够根据声学参数生成声音的器件,因此采用该器件可以输出合成语音。Among them, the vocoder is a device capable of generating sound according to acoustic parameters, so that the device can output synthesized speech.
一些实施例中,参见图6,语音合成的流程可以包括:In some embodiments, referring to FIG. 6, the process of speech synthesis may include:
S61:对待合成的文本进行文本预处理。S61: Text preprocessing of the text to be synthesized.
S62:判断特色音库内是否存在与待合成的文本对应的声音数据,若是,执行S63,否则,执行S64。S62: Determine whether there is sound data corresponding to the text to be synthesized in the featured sound bank, and if yes, execute S63; otherwise, execute S64.
其中,当特色音库内保存特定文本与对应的声音数据时,可以通过查找方式判断特色音库内是否保存与待合成的文本一致的特定文本。 When the specific text and the corresponding sound data are saved in the featured sound library, it is possible to determine whether the specific text in the characteristic sound library is consistent with the text to be synthesized is determined by the search method.
可以理解的是,由于不同的发音人对相同的文本内容可能采用不同的朗诵方式,因此,特定文本与对应的声音数据可以是完全一致的,或者在误差范围内一致的。例如,对应于“前方红绿灯,请注意遵守交通规则”这一特定文本,不同的人可能有不同的发挥,某个音库中录制的声音可能会对应“小心点,马上红绿灯了,闯红灯要罚款的!”这样的内容的声音。It can be understood that, since different speakers may adopt different recitation methods for the same text content, the specific text and the corresponding sound data may be completely consistent or consistent within the error range. For example, corresponding to the specific text "Front traffic lights, please pay attention to the traffic rules", different people may have different play, the sound recorded in a certain sound library may correspond to "Be careful, immediately traffic lights, red light will be fine !" The sound of such content.
S63:获取对应的声音数据。S63: Acquire corresponding sound data.
例如,特色音库内存在与待合成的文本一致的特定文本时,则可以获取与该特定文本对应的声音数据。For example, when there is a specific text in the featured sound library that is consistent with the text to be synthesized, the sound data corresponding to the specific text can be acquired.
在获取声音数据后可以将该声音数据作为最终要合成的合成语音。或者,如果在特色音库内,与特定文本对应的声音数据是压缩处理后存储的,则在特色音库内获取对应的声音数据后,可以对获取的声音数据进行解压缩处理,将解压缩处理后的声音数据作为合成语音。The sound data can be used as the synthesized speech to be synthesized later after the sound data is acquired. Alternatively, if the sound data corresponding to the specific text is stored after the compression processing in the featured sound library, after the corresponding sound data is acquired in the featured sound library, the obtained sound data may be decompressed and decompressed. The processed sound data is used as a synthesized voice.
S64:对预处理后的文本进行文本分析。S64: Perform text analysis on the preprocessed text.
S65:对文本分析后的文本韵律预测。S65: Text prosody prediction after text analysis.
S61、S64和S65的具体内容可以参见已有语音合成的相关流程。For details of S61, S64, and S65, refer to related processes of existing speech synthesis.
S66:根据韵律预测后的文本以及特色音库内的特色声学模型,进行声学参数生成,生成声学参数。S66: Acquire acoustic parameters according to the prosperous predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.
与现有技术不同的是,本实施例采用的特色声学模型而不是已有的声学模型,而在确定声学模型后,生成声学参数的流程可以参见已有方式。Different from the prior art, the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
S67:根据声学参数进行声码器参数合成,获取待合成的文本对应的合成语音。S67: Perform vocoder parameter synthesis according to the acoustic parameters, and obtain synthesized speech corresponding to the text to be synthesized.
其中,声码器是一种能够根据声学参数生成声音的器件,因此采用该器件可以输出合成语音。Among them, the vocoder is a device capable of generating sound according to acoustic parameters, so that the device can output synthesized speech.
一些实施例中,参见图7,语音合成的流程可以包括:In some embodiments, referring to FIG. 7, the flow of speech synthesis may include:
S71:对待合成的文本进行文本预处理。S71: Text preprocessing of the text to be synthesized.
S72:判断特色音库内是否存在与待合成的文本对应的声音数据,若是,执行S73,否则,执行S74。S72: Determine whether there is sound data corresponding to the text to be synthesized in the featured sound library, and if yes, execute S73; otherwise, execute S74.
其中,当特色音库内保存特定文本与对应的声音数据时,可以通过查找方式判断特色音库内是否保存与待合成的文本一致的特定文本。When the specific text and the corresponding sound data are saved in the featured sound library, it is possible to determine whether the specific text in the characteristic sound library is consistent with the text to be synthesized is determined by the search method.
可以理解的是,由于不同的发音人对相同的文本内容可能采用不同的朗诵方式,因此,特定文本与对应的声音数据可以是完全一致的,或者在误差范围内一致的。例如,对应于“前方红绿灯,请注意遵守交通规则”这一特定文本,不同的人可能有不同的发挥,某个音库中录制的声音可能会对应“小心点,马上红绿灯了,闯红灯要罚 款的!”这样的内容的声音。It can be understood that, since different speakers may adopt different recitation methods for the same text content, the specific text and the corresponding sound data may be completely consistent or consistent within the error range. For example, corresponding to the specific text of “Front traffic lights, please pay attention to the traffic rules”, different people may have different play, the sound recorded in a certain sound library may correspond to “Be careful, immediately traffic lights, red light will be fined Money! "The sound of such content.
S73:获取对应的声音数据。S73: Acquire corresponding sound data.
例如,特色音库内存在与待合成的文本一致的特定文本时,则可以获取与该特定文本对应的声音数据。For example, when there is a specific text in the featured sound library that is consistent with the text to be synthesized, the sound data corresponding to the specific text can be acquired.
在获取声音数据后可以将该声音数据作为最终要合成的合成语音。或者,如果在特色音库内,与特定文本对应的声音数据是压缩处理后存储的,则在特色音库内获取对应的声音数据后,可以对获取的声音数据进行解压缩处理,将解压缩处理后的声音数据作为合成语音。The sound data can be used as the synthesized speech to be synthesized later after the sound data is acquired. Alternatively, if the sound data corresponding to the specific text is stored after the compression processing in the featured sound library, after the corresponding sound data is acquired in the featured sound library, the obtained sound data may be decompressed and decompressed. The processed sound data is used as a synthesized voice.
S74:对预处理后的文本进行文本分析。S74: Perform text analysis on the preprocessed text.
S75:对文本分析后的文本韵律预测。S75: Text prosody prediction after text analysis.
S71、S74和S75的具体内容可以参见已有语音合成的相关流程。For details of S71, S74, and S75, refer to related processes of existing speech synthesis.
S76:根据韵律预测后的文本以及特色音库内的特色声学模型,进行声学参数生成,生成声学参数。S76: Acquire acoustic parameters according to the prosperous predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.
与现有技术不同的是,本实施例采用的特色声学模型而不是已有的声学模型,而在确定声学模型后,生成声学参数的流程可以参见已有方式。Different from the prior art, the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.
S77:根据声学参数在特色音库内获取对应的声学片断,对获取的声学片断进行拼接合成,获取待合成的文本对应的合成语音。S77: Acquire corresponding acoustic segments in the featured sound library according to the acoustic parameters, perform stitching and combining on the acquired acoustic segments, and obtain synthesized speech corresponding to the text to be synthesized.
其中,在创建声学片断时还可以创建相应的声学参数,之后在特色音库内将声学参数与声学片断对应存储,从而在语音合成时可以根据声学参数查找到对应的声学片断。Wherein, when the acoustic segment is created, corresponding acoustic parameters can also be created, and then the acoustic parameters are stored corresponding to the acoustic segments in the featured sound bank, so that the corresponding acoustic segments can be found according to the acoustic parameters during speech synthesis.
在获取声学片断后,可以对这些片断进行拼接,从而得到文本对应的语音,实现语音合成。After the acoustic segments are acquired, the segments can be spliced to obtain a text-corresponding speech for speech synthesis.
本实施例中,通过在语音合成时从服务端下载音库,而不是直接在APP中包含音库,可以降低APP的体积,另外,相对于在APP内包含音库的方式,服务端内可以存储更多的音库,通过在服务端下载音库,可以为用户提供更多的选择,通过可用音库内包括特色音库,可以满足用户个性化需求,提升用户体验。通过采用不同方式创建特色音库和采用不同的方式根据特色音库进行语音合成,可以满足不同场景需求,实现多样化。In this embodiment, by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, the volume of the APP can be reduced. In addition, the server can be included in the manner of including the sound bank in the APP. By storing more sound banks, you can provide users with more choices by downloading the sound library on the server. By including the featured sound library in the available sound library, you can meet the individual needs of users and enhance the user experience. By creating distinctive sound banks in different ways and synthesizing speech based on featured sound banks in different ways, it can meet the needs of different scenes and achieve diversification.
图8是本发明另一实施例提出的语音合成系统的结构示意图,该系统包括:客户端装置81,客户端装置81包括:FIG. 8 is a schematic structural diagram of a voice synthesizing system according to another embodiment of the present invention. The system includes: a client device 81, and the client device 81 includes:
查询模块811,用于在需要语音合成时,从服务端查询可用音库列表,所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库;The query module 811 is configured to query, from the server, a list of available sound banks when the voice synthesis is required, where the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;
与现有技术中直接在APP内包含音库不同的是,本实施例中,不需要在APP内包含音 库,而是在需要音库时从服务端下载。Different from the prior art that the sound library is included in the APP, in this embodiment, it is not necessary to include the sound in the APP. Library, but download from the server when a library is needed.
例如,客户端上APP对应的软件开发工具包(Software Development Kit,SDK)向服务端发送查询请求,该查询请求用于请求可用音库列表,服务端接收到该查询请求后获取可用音库列表,并将获取的可用音库列表发送给SDK。For example, a software development kit (SDK) corresponding to the APP on the client sends a query request to the server, and the query request is used to request a list of available sound banks, and the server obtains a list of available sound banks after receiving the query request. And send the obtained list of available sound banks to the SDK.
本实施例中的可用音库包括特色音库,当然,可以理解的是,可用音库还可以包括已有的普通音库。The available sound banks in this embodiment include a featured sound bank. Of course, it can be understood that the available sound banks can also include existing common sound banks.
其中,特色音库是预先生成的,用于满足个性化需求的音库,与普通音库不同的音库,例如童声音库,或者用户定制的音库等。Among them, the featured sound library is pre-generated, a sound library for satisfying individual needs, a sound library different from the ordinary sound library, such as a children's sound library, or a user-defined sound library.
获取模块812,用于获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库;The obtaining module 812 is configured to obtain a sound bank selected by the user according to the available sound library list, and download a sound bank selected by the user from the server;
SDK从服务端获取可用音库列表后,可以将可用音库列表展示给用户,在展示时,可以具体展示每个可用音库的信息,例如,可用音库的生成者信息,生成时间、适合的离线语言合成引擎的版本、适合的待合成文本所属的领域、男声或女声或其他特色声音、音质等,从而方便用户选择。After the SDK obtains the list of available sound banks from the server, the list of available sound banks can be displayed to the user. When displaying, the information of each available sound library can be specifically displayed, for example, the generator information of the available sound library, the generation time, and the suitable The version of the offline language synthesis engine, the field to which the text to be synthesized belongs, the male or female voice or other characteristic sounds, sound quality, etc., thereby facilitating the user's choice.
用户可以根据展示的信息,选择一个或多个可用音库的信息。The user can select information for one or more available sound banks based on the information presented.
用户在选择可用音库的信息后,SDK可以确定对应的用户选择的可用音库,并根据选择的信息从服务端下载用户选择的可用音库。例如,可用音库的信息中还包括链接信息,用户选择可用音库的信息后,可以根据选择的信息中的链接信息下载相应的可用音库。After the user selects the information of the available sound bank, the SDK can determine the available sound bank selected by the corresponding user, and download the available sound bank selected by the user from the server according to the selected information. For example, the information of the available sound library further includes link information. After the user selects the information of the available sound library, the corresponding available sound library can be downloaded according to the link information in the selected information.
合成模块813,用于采用下载的音库,将文本合成为语音。The synthesizing module 813 is configured to synthesize the text into a voice by using the downloaded sound bank.
SDK从服务端下载音库后,就可以采用该音库实现语音合成。After the SDK downloads the sound library from the server, the sound library can be used to implement speech synthesis.
一些实施例中,参见图9,该系统还包括:服务端装置82,服务端装置包括:用于创建特色音库的创建模块821,所述创建模块821具体用于:In some embodiments, referring to FIG. 9, the system further includes: a server device 82, the server device includes: a creating module 821 for creating a featured sound library, and the creating module 821 is specifically configured to:
建立特色声学模型和获取声学片断,由所述特色声学模型和所述声学片断组成特色音库;或者,Establishing a characteristic acoustic model and acquiring an acoustic segment, and the characteristic acoustic model and the acoustic segment constitute a characteristic sound bank; or
建立特色声学模型,由所述特色声学模型组成特色音库;或者,Establishing a characteristic acoustic model, and the characteristic acoustic model is composed of the characteristic acoustic library; or
获取与特定文本对应的声音数据,由所述特定文本与所述声音数据组成特色音库;或者,Obtaining sound data corresponding to a specific text, and the specific text and the sound data constitute a characteristic sound bank; or
建立特色声学模型、获取声学片断,以及,获取与特定文本对应的声音数据,由所述特色声学模型,声学片断,以及,所述特定文本与所述声音数据组成特色音库;或者,Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank; or
建立特色声学模型,获取与特定文本对应的声音数据,由所述特色声学模型,以及,所述特定文本与所述声音数据组成特色音库。A characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
一些实施例中,所述创建模块821用于建立特色声学模型,包括: In some embodiments, the creation module 821 is used to create a characteristic acoustic model, including:
获取特色声音数据,并对所述特色声音数据进行训练,建立特色声学模型;或者,Obtaining characteristic sound data, and training the characteristic sound data to establish a characteristic acoustic model; or
获取已有的声学模型和特色声音数据,根据所述特色声音数据对所述已有的声学模型进行自适应训练,建立特色声学模型。Obtaining an existing acoustic model and characteristic sound data, and adaptively training the existing acoustic model according to the characteristic sound data to establish a characteristic acoustic model.
其中,直接对特色声音数据进行训练得到特色声学模型时需要的样本量大于对已有的声学模型进行自适应训练时需要的样本量。Among them, the sample size required to directly train the characteristic sound data to obtain the characteristic acoustic model is larger than the sample size required for the adaptive training of the existing acoustic model.
例如,录制/收集一定规模的、特定音色的声音数据,并进行人工或者自动的韵律标注和边界标注,训练得到特色声学模型。或者,利用现有的声学模型,录制/收集少量的特定音色的声音数据,通过自适应模型训练技术,将现有声学模型更新为特色声学模型。For example, recording/collecting sound data of a certain size and specific timbre, and performing artificial or automatic prosody labeling and boundary labeling, and training to obtain a characteristic acoustic model. Alternatively, using existing acoustic models, recording/collecting a small amount of sound data of a specific timbre, and updating the existing acoustic model to a characteristic acoustic model through adaptive model training techniques.
一些实施例中,获取声学片断,可以包括:In some embodiments, obtaining an acoustic segment can include:
对训练样本进行切分得到声学片断。The training samples are segmented to obtain an acoustic segment.
例如,录制/收集一定规模的、特定音色的声音数据,并进行人工或者自动的韵律标注和边界标注,切分得到声学片断。For example, recording/collecting sound data of a certain size and specific timbre, and performing manual or automatic rhythm annotation and boundary labeling, and segmenting the acoustic segments.
一些实施例中,所述创建模块821用于获取与特定文本对应的声音数据,包括:In some embodiments, the creating module 821 is configured to acquire sound data corresponding to a specific text, including:
选取要朗诵的特定文本;Select the specific text you want to recite;
获取特定发音人对所述特定文本的朗诵语音;Obtaining a recited voice of a particular speaker to the particular text;
将所述朗诵语音或者对所述朗诵语音进行压缩处理后的语音作为与所述特定文本对应的声音数据。The recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
例如,请特定的发音人有感情的朗诵特定文本,获取相应的声音数据,实现声音数据的定制化。For example, ask a specific speaker to recite specific texts emotionally, obtain corresponding sound data, and customize the sound data.
可选的,为了节省空间,可以对获取的朗诵语音进行压缩处理,将压缩处理后的语音作为最终保存在特色音库内的声音数据。Optionally, in order to save space, the acquired recited voice may be compressed, and the compressed voice is used as the sound data finally stored in the featured sound bank.
另外,可以通过不同的发音人对相同或不同的特定文本进行朗诵获取不同的声音数据,之后可以将多个特定文本与声音数据分别对应存储,组成定制库。In addition, different sound data can be obtained by reading different or different specific texts by different speakers, and then a plurality of specific texts can be stored correspondingly with the sound data to form a customized library.
一些实施例中,参见图9,该系统还包括:位于服务端的第一集群系统822和第二集群系统823,所述查询模块811具体用于:In some embodiments, referring to FIG. 9, the system further includes: a first cluster system 822 and a second cluster system 823 at the server end, where the query module 811 is specifically configured to:
向服务端发送查询请求,所述查询请求中包含查询条件,使得所述服务端根据所述查询条件获取查询结果,其中,当第一集群系统中存在所述查询结果时,从所述第一集群系统中获取所述查询结果,或者,当所述第一集群系统中不存在所述查询结果,则从第二集群系统中获取所述查询结果,并将获取的查询结果缓存到所述第一集群系统中;Sending a query request to the server, where the query request includes a query condition, so that the server obtains the query result according to the query condition, wherein when the query result exists in the first cluster system, the first Obtaining the query result in the cluster system, or acquiring the query result from the second cluster system when the query result does not exist in the first cluster system, and buffering the obtained query result to the first In a cluster system;
接收所述服务端发送的可用音库列表,所述可用音库列表是所述服务端根据所述查询结果获取的。Receiving a list of available sound banks sent by the server, where the available sound library list is obtained by the server according to the query result.
其中,查询请求中可以包含查询条件,例如,语音合成引擎的版本、领域、特色 声音等,服务端接收到查询请求后,获取满足查询条件的查询结果。The query request may include a query condition, for example, a version, a domain, and a feature of the speech synthesis engine. After the server receives the query request, it obtains the query result that satisfies the query condition.
为了应对SDK端可能出现的爆发式查询请求,可以对查询结果进行缓存。参见图3,以将查询结果存储到memcached集群36为例。In order to cope with the explosive query request that may appear on the SDK side, the query result can be cached. See Figure 3 for an example of storing query results to the memcached cluster 36.
因此,当服务端接收到查询请求后,可以先在memcached集群中查询,如果可以查找到满足查询条件的查询结果,则可以直接从memcached集群中获取查询结果。或者,如果在memcached集群中不能查找到满足查询条件的查询结果,则可以再在mysql集群中查询,当mysql集群中存在满足查询条件的查询结果时,从mysql集群中获取查询结果,并且,可以将从mysql集群获取的查询结果缓存到memcached集群中,以便后续可以直接从memcached集群获取查询结果。Therefore, when the server receives the query request, it can first query in the memcached cluster. If the query result that satisfies the query condition can be found, the query result can be obtained directly from the memcached cluster. Or, if you can't find the query result that meets the query condition in the memcached cluster, you can query it in the mysql cluster. When there are query results in the mysql cluster that satisfy the query condition, the query result is obtained from the mysql cluster, and The query results obtained from the mysql cluster are cached into the memcached cluster so that the query results can be obtained directly from the memcached cluster.
此处的第一集群系统对应方法实施例中的memcached集群。The first cluster system here corresponds to the memcached cluster in the method embodiment.
例如,参见图3,物理机房从memcached集群内获取查询结果,另外,还可以从BOS云存储获取可用音库的作为链接信息的存储地址,则对应每个可用音库,可用音库的信息可以包括:查询结果(如适合语音合成引擎的版本、领域、特色声音等)和链接信息,之后,可以由多个可用音库的信息组成可用音库列表。For example, referring to FIG. 3, the physical equipment room obtains the query result from the memcached cluster. In addition, the storage address of the available sound library as the link information can be obtained from the BOS cloud storage, and the available sound library information can be used for each available sound library. These include: query results (such as versions, fields, featured sounds, etc. for speech synthesis engines) and link information, after which the list of available sound banks can be composed of information from multiple available sound banks.
一些实施例中,参见图9,所述可用音库的信息包括:在创建可用音库后对应生成的信息,所述系统还包括:位于服务端的第二集群系统823,所述第二集群系统823用于存储所述创建可用音库后对应生成的信息。创建可用音库后对应生成的信息对应方法实施例中的特色音库信息,特色音库信息在后续流程可以作为查询结果提供给用户,每个查询结果可以作为可用音库列表中每个可用音库信息中一种信息。In some embodiments, referring to FIG. 9, the information of the available sound library includes: corresponding generated information after creating an available sound library, the system further comprising: a second cluster system 823 located at the server, the second cluster system 823 is configured to store information corresponding to the generated generated sound library. After the available sound library is created, the corresponding generated information corresponds to the characteristic sound library information in the method embodiment, and the featured sound database information can be provided to the user as a query result in the subsequent process, and each query result can be used as each available sound in the available sound library list. A kind of information in the library information.
一些实施例中,参见图9,所述可用音库的信息包括:可用音库的链接信息,所述系统还包括:位于服务端的存储模块824,所述存储模块824用于存储生成的可用音库,并将可用音库的存储地址作为链接信息。In some embodiments, referring to FIG. 9, the information of the available sound library includes: link information of the available sound library, the system further includes: a storage module 824 located at the server, the storage module 824 is configured to store the generated available sound Library, and the storage address of the available sound bank as the link information.
此处的在创建可用音库后对应生成的信息对应上述实施例中的特色音库信息,特色音库信息是指对于特色音库生成的相关信息,例如,生成者信息,生成时间,适合的离线语言合成引擎的版本、适合的待合成文本所属的领域、男声或女声或其他特色声音、音质等。The information generated correspondingly after the creation of the available sound bank corresponds to the characteristic sound library information in the above embodiment, and the featured sound library information refers to related information generated for the featured sound library, for example, the generator information, the generation time, and the suitable The version of the offline language synthesis engine, the field to which the text to be synthesized belongs, the male or female voice or other distinctive sounds, sound quality, and the like.
在创建特色音库和特色音库信息后,可以存储。例如,参见图3,创建模块(图3中用manage console表示)31创建特色音库后,可以将特色音库存储在用于存储特色音库的存储模块(图3中用BOS云存储表示)32中,在创建特色音库信息后,将特色音库信息存储在用于存储特色音库信息的存储模块(图3中用mysql集群表示)33中。另外,特色音库信息在后续流程可以作为查询结果提供给用户,每个查询结果可以作为可用音库列表中每个可用音库信息中一种信息。After creating the featured library and featured library information, it can be stored. For example, referring to FIG. 3, after the creation module (represented by the manage console in FIG. 3) 31 creates a special sound library, the characteristic sound library can be stored in a storage module for storing the featured sound library (indicated by BOS cloud storage in FIG. 3). In the 32, after the characteristic sound library information is created, the characteristic sound library information is stored in a storage module (represented by mysql cluster in FIG. 3) 33 for storing the characteristic sound library information. In addition, the featured sound library information can be provided to the user as a query result in the subsequent process, and each query result can be used as one of the available sound library information in the available sound bank list.
因此,此处的第二集群系统对应方法实施例中的mysql集群。此处的存储模块对 应方法实施例中的BOS云存储。Therefore, the second cluster system here corresponds to the mysql cluster in the method embodiment. Storage module pair here The BOS cloud storage in the method embodiment.
另外,参见图3,在将特色音库存储后,存储特色音库的存储模块可以将特色音库的存储地址作为链接信息发送给服务端的入口节点,之后,入口节点将从mysql获取的特色音库信息及从存储模块获取的链接信息作为可用音库的信息,并由多个可用音库的信息组成可用音库列表发送给SDK。In addition, referring to FIG. 3, after storing the featured sound library, the storage module storing the featured sound library can send the storage address of the featured sound bank as the link information to the portal node of the server, and then the entry node will obtain the characteristic sound from mysql. The library information and the link information obtained from the storage module are used as information of the available sound bank, and are composed of a plurality of available sound library information to be sent to the SDK.
其中,存储模块向入口节点发送链接信息时可以对应发送特色音库的标识与链接信息,在mysql中存储特色音库信息时,将特色音库的标识与信息对应存储,以及,入口节点在从mysql中获取特色音库信息时,对应获取特色音库的标识与信息,之后,可以根据特色音库的标识将从mysql中获取的信息与从存储模块中获取的链接信息关联。Wherein, when the storage module sends the link information to the ingress node, the identifier and the link information of the featured sound library may be correspondingly sent. When the characteristic sound bank information is stored in the mysql, the identifier of the featured sound bank is stored correspondingly, and the entry node is in the slave node. When obtaining the characteristic sound bank information in mysql, the corresponding identifier and information of the characteristic sound library are obtained, and then the information obtained from mysql can be associated with the link information obtained from the storage module according to the identifier of the featured sound library.
当用户选择可用音库的信息后,SDK根据用户选择的信息中的链接信息从服务端下载选择的音库。After the user selects the information of the available sound bank, the SDK downloads the selected sound bank from the server according to the link information in the information selected by the user.
一些实施例中,所述合成模块813具体用于:In some embodiments, the synthesizing module 813 is specifically configured to:
当所述音库内包括声学模型和声学片断时,对文本进行处理,根据处理后的文本和所述声学模型获取声学参数,并根据所述声学参数获取对应的声学片断,以及,对获取的声学片断进行拼接合成,获取合成语音;或者,When an acoustic model and an acoustic segment are included in the sound library, the text is processed, acoustic parameters are acquired according to the processed text and the acoustic model, and corresponding acoustic segments are acquired according to the acoustic parameters, and Acoustic segments are spliced and combined to obtain synthesized speech; or,
当所述音库内包括声学模型时,对文本进行处理,根据处理后的文本和所述声学模型获取声学参数,根据所述声学参数进行声码器参数合成,获取合成语音;或者,When the acoustic model is included in the sound library, the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or
当所述音库内包括声学模型、特定文本与对应的声音数据时,对文本进行预处理,在所述音库内存在与预处理后的文本一致的特定文本时,获取与所述特定文本对应的声音数据,将所述声音数据或者对所述声音数据进行解压缩处理后的声音数据作为合成语音。When the sound library includes an acoustic model, specific text, and corresponding sound data, the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired Corresponding sound data, the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
具体的语音合成的内容可以参见图4-7,在此不再赘述。For details of the content of the speech synthesis, refer to FIG. 4-7, and details are not described herein again.
本实施例中,通过在语音合成时从服务端下载音库,而不是直接在APP中包含音库,可以降低APP的体积,另外,相对于在APP内包含音库的方式,服务端内可以存储更多的音库,通过在服务端下载音库,可以为用户提供更多的选择,通过可用音库内包括特色音库,可以满足用户个性化需求,提升用户体验。通过采用不同方式创建特色音库和采用不同的方式根据特色音库进行语音合成,可以满足不同场景需求,实现多样化。In this embodiment, by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, the volume of the APP can be reduced. In addition, the server can be included in the manner of including the sound bank in the APP. By storing more sound banks, you can provide users with more choices by downloading the sound library on the server. By including the featured sound library in the available sound library, you can meet the individual needs of users and enhance the user experience. By creating distinctive sound banks in different ways and synthesizing speech based on featured sound banks in different ways, it can meet the needs of different scenes and achieve diversification.
本发明实施例还提出了一种电子设备,包括:一个或者多个处理器;存储器;一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时:An embodiment of the present invention further provides an electronic device, including: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when the one or more When the processor executes:
在需要语音合成时,从服务端查询可用音库列表,所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库;When voice synthesis is required, the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;
获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库; Obtaining a sound bank selected by the user according to the available sound library list, and downloading a sound bank selected by the user from the server;
采用下载的音库,将文本合成为语音。The text is synthesized into speech using the downloaded sound bank.
本发明实施例还提出了一种非易失性计算机存储介质,所述计算机存储介质存储有一个或者多个模块,当所述一个或者多个模块被执行时:Embodiments of the present invention also provide a non-volatile computer storage medium storing one or more modules when the one or more modules are executed:
在需要语音合成时,从服务端查询可用音库列表,所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库;When voice synthesis is required, the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;
获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库;Obtaining a sound bank selected by the user according to the available sound library list, and downloading a sound bank selected by the user from the server;
采用下载的音库,将文本合成为语音。The text is synthesized into speech using the downloaded sound bank.
需要说明的是,在本发明的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本发明的描述中,除非另有说明,“多个”的含义是指至少两个。It should be noted that in the description of the present invention, the terms "first", "second" and the like are used for descriptive purposes only, and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise stated.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。Any process or method description in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a particular logical function or process. And the scope of the preferred embodiments of the invention includes additional implementations, in which the functions may be performed in a substantially simultaneous manner or in an opposite order depending on the functions involved, in the order shown or discussed. It will be understood by those skilled in the art to which the embodiments of the present invention pertain.
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that portions of the invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。One of ordinary skill in the art can understand that all or part of the steps carried by the method of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, one or a combination of the steps of the method embodiments is included.
此外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
上述提到的存储介质可以是只读存储器,磁盘或光盘等。The above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点 包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. , structure, material or characteristics It is included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms does not necessarily mean the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。 Although the embodiments of the present invention have been shown and described, it is understood that the above-described embodiments are illustrative and are not to be construed as limiting the scope of the invention. The embodiments are subject to variations, modifications, substitutions and variations.

Claims (18)

  1. 一种语音合成方法,其特征在于,包括:A speech synthesis method, comprising:
    在需要语音合成时,从服务端查询可用音库列表,所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库;When voice synthesis is required, the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;
    获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库;Obtaining a sound bank selected by the user according to the available sound library list, and downloading a sound bank selected by the user from the server;
    采用下载的音库,将文本合成为语音。The text is synthesized into speech using the downloaded sound bank.
  2. 根据权利要求1所述的方法,其特征在于,还包括:创建特色音库,所述创建特色音库包括:The method of claim 1 further comprising: creating a featured sound bank, said creating a featured sound bank comprising:
    建立特色声学模型和获取声学片断,由所述特色声学模型和所述声学片断组成特色音库;或者,Establishing a characteristic acoustic model and acquiring an acoustic segment, and the characteristic acoustic model and the acoustic segment constitute a characteristic sound bank; or
    建立特色声学模型,由所述特色声学模型组成特色音库;或者,Establishing a characteristic acoustic model, and the characteristic acoustic model is composed of the characteristic acoustic library; or
    获取与特定文本对应的声音数据,由所述特定文本与所述声音数据组成特色音库;或者,Obtaining sound data corresponding to a specific text, and the specific text and the sound data constitute a characteristic sound bank; or
    建立特色声学模型、获取声学片断,以及,获取与特定文本对应的声音数据,由所述特色声学模型,声学片断,以及,所述特定文本与所述声音数据组成特色音库;或者,Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank; or
    建立特色声学模型,获取与特定文本对应的声音数据,由所述特色声学模型,以及,所述特定文本与所述声音数据组成特色音库。A characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
  3. 根据权利要求2所述的方法,其特征在于,所述建立特色声学模型,包括:The method of claim 2 wherein said establishing a characteristic acoustic model comprises:
    获取特色声音数据,并对所述特色声音数据进行训练,建立特色声学模型;或者,Obtaining characteristic sound data, and training the characteristic sound data to establish a characteristic acoustic model; or
    获取已有的声学模型和特色声音数据,根据所述特色声音数据对所述已有的声学模型进行自适应训练,建立特色声学模型。Obtaining an existing acoustic model and characteristic sound data, and adaptively training the existing acoustic model according to the characteristic sound data to establish a characteristic acoustic model.
  4. 根据权利要求2或3所述的方法,其特征在于,所述获取与特定文本对应的声音数据,包括:The method according to claim 2 or 3, wherein the acquiring the sound data corresponding to the specific text comprises:
    选取要朗诵的特定文本;Select the specific text you want to recite;
    获取特定发音人对所述特定文本的朗诵语音;Obtaining a recited voice of a particular speaker to the particular text;
    将所述朗诵语音或者对所述朗诵语音进行压缩处理后的语音作为与所述特定文本对应的声音数据。The recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述从服务端查询可用音库列表,包括:The method according to any one of claims 1-4, wherein the querying the list of available sound banks from the server comprises:
    向服务端发送查询请求,所述查询请求中包含查询条件,使得所述服务端根据所述查询条件获取查询结果,其中,当第一集群系统中存在所述查询结果时,从所述第一集群系 统中获取所述查询结果,或者,当所述第一集群系统中不存在所述查询结果时,从第二集群系统中获取所述查询结果,并将获取的查询结果缓存到所述第一集群系统中;Sending a query request to the server, where the query request includes a query condition, so that the server obtains the query result according to the query condition, wherein when the query result exists in the first cluster system, the first Cluster system Obtaining the query result in the system, or, when the query result does not exist in the first cluster system, obtaining the query result from the second cluster system, and buffering the obtained query result to the first In a cluster system;
    接收所述服务端发送的可用音库列表,所述可用音库列表是所述服务端根据所述查询结果获取的。Receiving a list of available sound banks sent by the server, where the available sound library list is obtained by the server according to the query result.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述可用音库的信息包括:在创建可用音库后对应生成的信息,所述信息被存储在服务端的第二集群系统中。The method according to any one of claims 1 to 5, wherein the information of the available sound library comprises: correspondingly generated information after creating an available sound bank, the information being stored in a second cluster system of the server. in.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述可用音库的信息包括:可用音库的链接信息,所述从服务端下载用户选择的音库,包括:The method according to any one of claims 1-6, wherein the information of the available sound library comprises: link information of the available sound library, and the downloading the sound bank selected by the user from the server comprises:
    根据所述链接信息从服务端下载对应的音库,其中,所述链接信息是存储可用音库后的存储地址。And downloading, according to the link information, a corresponding sound bank from the server, wherein the link information is a storage address after storing the available sound bank.
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述采用下载的音库,将文本合成为语音,包括:The method according to any one of claims 1 to 7, wherein the synthesizing the text into a voice using the downloaded sound library comprises:
    当所述音库内包括声学模型和声学片断时,对文本进行处理,根据处理后的文本和所述声学模型获取声学参数,并根据所述声学参数获取对应的声学片断,以及,对获取的声学片断进行拼接合成,获取合成语音;或者,When an acoustic model and an acoustic segment are included in the sound library, the text is processed, acoustic parameters are acquired according to the processed text and the acoustic model, and corresponding acoustic segments are acquired according to the acoustic parameters, and Acoustic segments are spliced and combined to obtain synthesized speech; or,
    当所述音库内包括声学模型时,对文本进行处理,根据处理后的文本和所述声学模型获取声学参数,根据所述声学参数进行声码器参数合成,获取合成语音;或者,When the acoustic model is included in the sound library, the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or
    当所述音库内包括声学模型、特定文本与对应的声音数据时,对文本进行预处理,在所述音库内存在与预处理后的文本一致的特定文本时,获取与所述特定文本对应的声音数据,将所述声音数据或者对所述声音数据进行解压缩处理后的声音数据作为合成语音。When the sound library includes an acoustic model, specific text, and corresponding sound data, the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired Corresponding sound data, the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
  9. 一种语音合成系统,其特征在于,包括:客户端装置,所述客户端装置包括:A speech synthesis system, comprising: a client device, the client device comprising:
    查询模块,用于在需要语音合成时,从服务端查询可用音库列表,所述可用音库列表中包括多个可用音库的信息,所述可用音库包括特色音库;a query module, configured to query, from the server, a list of available sound banks, where the available sound library list includes information of a plurality of available sound banks, where the available sound library includes a featured sound library;
    获取模块,用于获取用户根据所述可用音库列表选择的音库,并从服务端下载用户选择的音库;An obtaining module, configured to acquire a sound bank selected by the user according to the available sound library list, and download a sound bank selected by the user from the server;
    合成模块,用于采用下载的音库,将文本合成为语音。A synthesis module that synthesizes text into speech using a downloaded sound bank.
  10. 根据权利要求9所述的系统,其特征在于,还包括:服务端装置,所述服务端装置包括用于创建特色音库的创建模块,所述创建模块具体用于:The system of claim 9, further comprising: a server device, the server device comprising a creation module for creating a featured sound library, the creation module being specifically configured to:
    建立特色声学模型和获取声学片断,由所述特色声学模型和所述声学片断组成特色音库;或者,Establishing a characteristic acoustic model and acquiring an acoustic segment, and the characteristic acoustic model and the acoustic segment constitute a characteristic sound bank; or
    建立特色声学模型,由所述特色声学模型组成特色音库;或者,Establishing a characteristic acoustic model, and the characteristic acoustic model is composed of the characteristic acoustic library; or
    获取与特定文本对应的声音数据,由所述特定文本与所述声音数据组成特色音库;或 者,Acquiring sound data corresponding to a specific text, and the specific text and the sound data constitute a characteristic sound bank; or By,
    建立特色声学模型、获取声学片断,以及,获取与特定文本对应的声音数据,由所述特色声学模型,声学片断,以及,所述特定文本与所述声音数据组成特色音库;或者,Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank; or
    建立特色声学模型,获取与特定文本对应的声音数据,由所述特色声学模型,以及,所述特定文本与所述声音数据组成特色音库。A characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
  11. 根据权利要求10所述的系统,其特征在于,所述创建模块用于建立特色声学模型,包括:The system of claim 10 wherein said creating module is operative to create a characteristic acoustic model comprising:
    获取特色声音数据,并对所述特色声音数据进行训练,建立特色声学模型;或者,Obtaining characteristic sound data, and training the characteristic sound data to establish a characteristic acoustic model; or
    获取已有的声学模型和特色声音数据,根据所述特色声音数据对所述已有的声学模型进行自适应训练,建立特色声学模型。Obtaining an existing acoustic model and characteristic sound data, and adaptively training the existing acoustic model according to the characteristic sound data to establish a characteristic acoustic model.
  12. 根据权利要求10或11所述的系统,其特征在于,所述创建模块用于获取与特定文本对应的声音数据,包括:The system according to claim 10 or 11, wherein the creating module is configured to acquire sound data corresponding to the specific text, including:
    选取要朗诵的特定文本;Select the specific text you want to recite;
    获取特定发音人对所述特定文本的朗诵语音;Obtaining a recited voice of a particular speaker to the particular text;
    将所述朗诵语音或者对所述朗诵语音进行压缩处理后的语音作为与所述特定文本对应的声音数据。The recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
  13. 根据权利要求9-12任一项所述的系统,其特征在于,还包括:位于服务端的第一集群系统和第二集群系统,所述查询模块具体用于:The system according to any one of claims 9 to 12, further comprising: a first cluster system and a second cluster system at the server end, wherein the query module is specifically configured to:
    向服务端发送查询请求,所述查询请求中包含查询条件,使得所述服务端根据所述查询条件获取查询结果,其中,当第一集群系统中存在所述查询结果时,从所述第一集群系统中获取所述查询结果,或者,当所述第一集群系统中不存在所述查询结果时,从第二集群系统中获取所述查询结果,并将获取的查询结果缓存到所述第一集群系统中;Sending a query request to the server, where the query request includes a query condition, so that the server obtains the query result according to the query condition, wherein when the query result exists in the first cluster system, the first Obtaining the query result in the cluster system, or acquiring the query result from the second cluster system when the query result does not exist in the first cluster system, and buffering the obtained query result to the first In a cluster system;
    接收所述服务端发送的可用音库列表,所述可用音库列表是所述服务端根据所述查询结果获取的。Receiving a list of available sound banks sent by the server, where the available sound library list is obtained by the server according to the query result.
  14. 根据权利要求9-13任一项所述的系统,其特征在于,所述可用音库的信息包括:在创建可用音库后对应生成的信息,所述系统还包括:位于服务端的第二集群系统,所述第二集群系统用于存储所述创建可用音库后对应生成的信息。The system according to any one of claims 9 to 13, wherein the information of the available sound library comprises: corresponding information generated after creating an available sound library, the system further comprising: a second cluster located at the server end The second cluster system is configured to store the information corresponding to the generated sound library.
  15. 根据权利要求9-14任一项所述的系统,其特征在于,所述可用音库的信息包括:可用音库的链接信息,所述系统还包括:位于服务端的存储模块,所述存储模块用于存储生成的可用音库,并将可用音库的存储地址作为链接信息。The system according to any one of claims 9 to 14, wherein the information of the available sound library comprises: link information of the available sound library, the system further comprising: a storage module located at the server, the storage module Used to store the generated available sound bank and use the storage address of the available sound bank as the link information.
  16. 根据权利要求9-15任一项所述的系统,其特征在于,所述合成模块具体用于:The system according to any one of claims 9 to 15, wherein the synthesis module is specifically configured to:
    当所述音库内包括声学模型和声学片断时,对文本进行处理,根据处理后的文本和所 述声学模型获取声学参数,并根据所述声学参数获取对应的声学片断,以及,对获取的声学片断进行拼接合成,获取合成语音;或者,When the acoustic library and the acoustic segment are included in the sound library, the text is processed according to the processed text and the Obtaining an acoustic parameter, acquiring a corresponding acoustic segment according to the acoustic parameter, and splicing and synthesizing the acquired acoustic segment to obtain a synthesized speech; or
    当所述音库内包括声学模型时,对文本进行处理,根据处理后的文本和所述声学模型获取声学参数,根据所述声学参数进行声码器参数合成,获取合成语音;或者,When the acoustic model is included in the sound library, the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or
    当所述音库内包括声学模型、特定文本与对应的声音数据时,对文本进行预处理,在所述音库内存在与预处理后的文本一致的特定文本时,获取与所述特定文本对应的声音数据,将所述声音数据或者对所述声音数据进行解压缩处理后的声音数据作为合成语音。When the sound library includes an acoustic model, specific text, and corresponding sound data, the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired Corresponding sound data, the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
  17. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    一个或者多个处理器;One or more processors;
    存储器;Memory
    一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时:One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:
    执行如权利要求1-8任一项所述的方法。Performing the method of any of claims 1-8.
  18. 一种非易失性计算机存储介质,其特征在于,所述计算机存储介质存储有一个或者多个模块,当所述一个或者多个模块被执行时:A non-volatile computer storage medium characterized in that the computer storage medium stores one or more modules when the one or more modules are executed:
    执行如权利要求1-8任一项所述的方法。 Performing the method of any of claims 1-8.
PCT/CN2015/097162 2015-07-24 2015-12-11 Voice synthesis method and system WO2017016135A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510441079.6 2015-07-24
CN201510441079.6A CN104992703B (en) 2015-07-24 2015-07-24 Phoneme synthesizing method and system

Publications (1)

Publication Number Publication Date
WO2017016135A1 true WO2017016135A1 (en) 2017-02-02

Family

ID=54304506

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/097162 WO2017016135A1 (en) 2015-07-24 2015-12-11 Voice synthesis method and system

Country Status (2)

Country Link
CN (1) CN104992703B (en)
WO (1) WO2017016135A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110364139A (en) * 2019-06-27 2019-10-22 上海麦克风文化传媒有限公司 A kind of matched text-to-speech working method of progress Autonomous role
CN111950016A (en) * 2019-05-14 2020-11-17 北京腾云天下科技有限公司 Method and device for generating data open output model and computing equipment

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992703B (en) * 2015-07-24 2017-10-03 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and system
CN109036374B (en) * 2018-07-03 2019-12-03 百度在线网络技术(北京)有限公司 Data processing method and device
CN110021291B (en) * 2018-12-26 2021-01-29 创新先进技术有限公司 Method and device for calling voice synthesis file
CN109903748A (en) * 2019-02-14 2019-06-18 平安科技(深圳)有限公司 A kind of phoneme synthesizing method and device based on customized sound bank
CN112750423B (en) * 2019-10-29 2023-11-17 阿里巴巴集团控股有限公司 Personalized speech synthesis model construction method, device and system and electronic equipment
CN110782869A (en) * 2019-10-30 2020-02-11 标贝(北京)科技有限公司 Speech synthesis method, apparatus, system and storage medium
CN110856023A (en) * 2019-11-15 2020-02-28 四川长虹电器股份有限公司 System and method for realizing customized broadcast of smart television based on TTS
CN111986648A (en) * 2020-06-29 2020-11-24 联想(北京)有限公司 Information processing method, device and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001272992A (en) * 2000-03-27 2001-10-05 Ricoh Co Ltd Voice processing system, text reading system, voice recognition system, dictionary acquiring method, dictionary registering method, terminal device, dictionary server, and recording medium
JP2002156988A (en) * 2000-11-21 2002-05-31 Matsushita Electric Ind Co Ltd Information providing system and voice synthesizer
JP2003233386A (en) * 2002-02-08 2003-08-22 Nippon Telegr & Teleph Corp <Ntt> Voice synthesizing method, voice synthesizer and voice synthesizing program
CN101246687A (en) * 2008-03-20 2008-08-20 北京航空航天大学 Intelligent voice interaction system and method thereof
CN102117614A (en) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction
US20140019137A1 (en) * 2012-07-12 2014-01-16 Yahoo Japan Corporation Method, system and server for speech synthesis
CN103581857A (en) * 2013-11-05 2014-02-12 华为终端有限公司 Method for giving voice prompt, text-to-speech server and terminals
CN104992703A (en) * 2015-07-24 2015-10-21 百度在线网络技术(北京)有限公司 Speech synthesis method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040054534A1 (en) * 2002-09-13 2004-03-18 Junqua Jean-Claude Client-server voice customization
JP2010521709A (en) * 2007-03-21 2010-06-24 トムトム インターナショナル ベスローテン フエンノートシャップ Apparatus and method for converting text into speech and delivering the same
CN102137140A (en) * 2010-10-08 2011-07-27 华为软件技术有限公司 Method, device and system for processing streaming services
US9117451B2 (en) * 2013-02-20 2015-08-25 Google Inc. Methods and systems for sharing of adapted voice profiles

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001272992A (en) * 2000-03-27 2001-10-05 Ricoh Co Ltd Voice processing system, text reading system, voice recognition system, dictionary acquiring method, dictionary registering method, terminal device, dictionary server, and recording medium
JP2002156988A (en) * 2000-11-21 2002-05-31 Matsushita Electric Ind Co Ltd Information providing system and voice synthesizer
JP2003233386A (en) * 2002-02-08 2003-08-22 Nippon Telegr & Teleph Corp <Ntt> Voice synthesizing method, voice synthesizer and voice synthesizing program
CN101246687A (en) * 2008-03-20 2008-08-20 北京航空航天大学 Intelligent voice interaction system and method thereof
CN102117614A (en) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction
US20140019137A1 (en) * 2012-07-12 2014-01-16 Yahoo Japan Corporation Method, system and server for speech synthesis
CN103581857A (en) * 2013-11-05 2014-02-12 华为终端有限公司 Method for giving voice prompt, text-to-speech server and terminals
CN104992703A (en) * 2015-07-24 2015-10-21 百度在线网络技术(北京)有限公司 Speech synthesis method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950016A (en) * 2019-05-14 2020-11-17 北京腾云天下科技有限公司 Method and device for generating data open output model and computing equipment
CN111950016B (en) * 2019-05-14 2023-11-21 北京腾云天下科技有限公司 Method and device for generating data open output model and computing equipment
CN110364139A (en) * 2019-06-27 2019-10-22 上海麦克风文化传媒有限公司 A kind of matched text-to-speech working method of progress Autonomous role
CN110364139B (en) * 2019-06-27 2023-04-18 上海麦克风文化传媒有限公司 Character-to-speech working method for intelligent role matching

Also Published As

Publication number Publication date
CN104992703B (en) 2017-10-03
CN104992703A (en) 2015-10-21

Similar Documents

Publication Publication Date Title
WO2017016135A1 (en) Voice synthesis method and system
US10861210B2 (en) Techniques for providing audio and video effects
CN106898340B (en) Song synthesis method and terminal
JP2021103328A (en) Voice conversion method, device, and electronic apparatus
KR101051252B1 (en) Methods, systems, and computer readable recording media for email management for rendering email in digital audio players
JP6936298B2 (en) Methods and devices for controlling changes in the mouth shape of 3D virtual portraits
TW202006534A (en) Method and device for audio synthesis, storage medium and calculating device
WO2017008426A1 (en) Speech synthesis method and device
JP6665446B2 (en) Information processing apparatus, program, and speech synthesis method
WO2016067766A1 (en) Voice synthesis device, voice synthesis method and program
CN110019962B (en) Method and device for generating video file information
CN112512649B (en) Techniques for providing audio and video effects
WO2017059694A1 (en) Speech imitation method and device
US20090177473A1 (en) Applying vocal characteristics from a target speaker to a source speaker for synthetic speech
JP2021101252A (en) Information processing method, information processing apparatus, and program
CN113704390A (en) Interaction method and device of virtual objects, computer readable medium and electronic equipment
CN113439447A (en) Room acoustic simulation using deep learning image analysis
WO2014067269A1 (en) Sent message playing method, system and related device
JP2023527473A (en) AUDIO PLAYING METHOD, APPARATUS, COMPUTER-READABLE STORAGE MEDIUM AND ELECTRONIC DEVICE
KR20240038941A (en) Method and system for generating avatar based on text
WO2022143530A1 (en) Audio processing method and apparatus, computer device, and storage medium
US20200111475A1 (en) Information processing apparatus and information processing method
CN115690277A (en) Video generation method, system, device, electronic equipment and computer storage medium
KR102001314B1 (en) Method and apparatus of enhancing audio quality recorded in karaoke room
CN114514576A (en) Data processing method, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15899484

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15899484

Country of ref document: EP

Kind code of ref document: A1