WO2017016135A1

WO2017016135A1 - Voice synthesis method and system

Info

Publication number: WO2017016135A1
Application number: PCT/CN2015/097162
Authority: WO
Inventors: 李秀林; 白洁; 李维高; 唐海员
Original assignee: 百度在线网络技术（北京）有限公司
Priority date: 2015-07-24
Filing date: 2015-12-11
Publication date: 2017-02-02
Also published as: CN104992703B; CN104992703A

Abstract

A voice synthesis method and system. The voice synthesis method comprises: when voice synthesis needs to be performed, querying an available sound bank list from a serving end, wherein the available sound bank list comprises information about a plurality of available sound banks, and the available sound banks comprise a characteristic sound bank (S11); acquiring a sound bank selected by a user according to the available sound bank list, and downloading the sound bank selected by the user from the serving end (S12); and using the downloaded sound bank to synthesize text into voice (S13). The method can reduce the volume of an off-line voice synthesis APP, and can provide more choices for a user, thereby realizing personalized voice synthesis.

Description

Speech synthesis method and system

Cross-reference to related applications

This application claims the priority of Chinese Patent Application No. 201510441079.6, filed on Jul. 24, 2015, entitled "Speech Synthesis Method and System" by Baidu Online Network Technology (Beijing) Co., Ltd.

Technical field

The present invention relates to the field of voice processing technologies, and in particular, to a voice synthesis method and system.

Background technique

In the prior art, when downloading an offline voice synthesis application (APP), the user may include one or two sound banks, and when the user uses the application, a sound library is selected, and then the application uses the user selection. The sound library performs Text To Speech (TTS) on the text to be played.

However, the prior art solution, on the one hand, includes a sound bank in the APP. Since the sound file is generally large, the size of the APP is large, and on the other hand, the type of the sound library included in the APP is limited, resulting in limited user selection space. .

Summary of the invention

The present invention aims to solve at least one of the technical problems in the related art to some extent.

To this end, it is an object of the present invention to provide a speech synthesis method that can reduce the volume of offline speech synthesis APP and provide users with more choices for personalized speech synthesis.

Another object of the present invention is to provide a speech synthesis system.

In order to achieve the above object, a speech synthesis method according to the first aspect of the present invention includes: when a speech synthesis is required, querying a list of available sound banks from a server, where the available sound library list includes information of a plurality of available sound banks. The available sound library includes a featured sound library; the sound library selected by the user according to the available sound library list is obtained, and the sound bank selected by the user is downloaded from the server; and the downloaded sound library is used to synthesize the text into a voice.

The speech synthesis method proposed by the embodiment of the first aspect of the present invention can reduce the volume of the APP by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, and additionally, the sound is included in the APP. In the way of the library, more sound banks can be stored in the server. By downloading the sound library on the server, you can provide more choices for users. By including the featured sound library in the available sound library, you can meet the individual needs of the user and improve. user experience.

In order to achieve the above object, a speech synthesis system according to a second aspect of the present invention includes: a client device, where the client device includes: a query module, configured to query a list of available sound banks from a server when voice synthesis is required , The available sound library list includes information of a plurality of available sound banks, the available sound library includes a featured sound library, and an obtaining module is configured to acquire a sound library selected by the user according to the available sound library list, and download the sound library from the server. User-selected sound library; synthesis module for synthesizing text into speech using the downloaded sound library.

The speech synthesis system proposed by the embodiment of the second aspect of the present invention can reduce the volume of the APP by downloading the sound library from the server during speech synthesis, instead of directly including the sound bank in the APP, and additionally, the sound is included in the APP. In the way of the library, more sound banks can be stored in the server. By downloading the sound library on the server, you can provide more choices for users. By including the featured sound library in the available sound library, you can meet the individual needs of the user and improve. user experience.

An embodiment of the present invention further provides an electronic device, including: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when the one or more When the processor is executed: the method according to any of the first aspect of the invention is performed.

Embodiments of the present invention also provide a non-volatile computer storage medium having one or more modules stored when the one or more modules are executed: performing the first aspect of the present invention The method of any of the preceding claims.

The additional aspects and advantages of the invention will be set forth in part in the description which follows.

DRAWINGS

The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from

1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;

2 is a schematic flow chart of a method for voice synthesis according to another embodiment of the present invention;

3 is a schematic diagram of a specific example of a speech synthesis system in an embodiment of the present invention;

4 is a schematic flowchart of voice synthesis of a specific example in the embodiment of the present invention;

FIG. 5 is a schematic flowchart of voice synthesis according to another specific example in the embodiment of the present invention; FIG.

6 is a schematic flowchart of voice synthesis of another specific example in the embodiment of the present invention;

7 is a schematic flowchart of voice synthesis of another specific example in the embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention; FIG.

FIG. 9 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention.

detailed description

The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the accompanying drawings, in which the same or similar reference numerals indicate the same or similar modules or modules having the same or similar functions. The following is described by reference to the drawings The embodiments described are illustrative only and are not to be construed as limiting the invention. Rather, the invention is to cover all modifications, modifications and equivalents within the spirit and scope of the appended claims.

FIG. 1 is a schematic flowchart of a voice synthesis method according to an embodiment of the present invention, where the method includes:

S11: When voice synthesis is required, the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library.

Different from the prior art that the sound library is included in the APP, in this embodiment, it is not necessary to include the sound bank in the APP, but to download from the server when the sound bank is needed.

For example, a software development kit (SDK) corresponding to the APP on the client sends a query request to the server, and the query request is used to request a list of available sound banks, and the server obtains a list of available sound banks after receiving the query request. And send the obtained list of available sound banks to the SDK.

The available sound banks in this embodiment include a featured sound bank. Of course, it can be understood that the available sound banks can also include existing common sound banks.

Among them, the featured sound library is pre-generated, a sound library for satisfying individual needs, a sound library different from the ordinary sound library, such as a children's sound library, or a user-defined sound library.

S12: Acquire a sound bank selected by the user according to the available sound library list, and download the user selected sound library from the server.

After the SDK obtains the list of available sound banks from the server, the list of available sound banks can be displayed to the user. When displaying, the information of each available sound library can be specifically displayed, for example, the generator information of the available sound library, the generation time, and the suitable The version of the offline language synthesis engine, the field to which the text to be synthesized belongs, the male or female voice or other characteristic sounds, sound quality, etc., thereby facilitating the user's choice.

The user can select information for one or more available sound banks based on the information presented.

After the user selects the information of the available sound bank, the SDK can determine the available sound bank selected by the corresponding user, and download the available sound bank selected by the user from the server. For example, the information of the available sound library further includes link information. After the user selects the information of the available sound library, the corresponding available sound bank can be downloaded according to the link information in the information.

S13: Synthesize the text into a voice using the downloaded sound bank.

After the SDK downloads the sound library from the server, the sound library can be used to implement speech synthesis.

In this embodiment, by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, the volume of the APP can be reduced. In addition, the server can be included in the manner of including the sound bank in the APP. By storing more sound banks, you can provide users with more choices by downloading the sound library on the server. By including the featured sound library in the available sound library, you can meet the individual needs of users and enhance the user experience.

2 is a schematic flowchart of a method for synthesizing speech according to another embodiment of the present invention. This embodiment provides an example for providing a user to select a special sound library, and the method includes:

S21: The server creates a featured sound library and corresponding characteristic sound library information, and stores the featured sound library and featured sounds. Library information.

Among them, create a special sound library, which can include:

Establishing a characteristic acoustic model and acquiring an acoustic segment, and the characteristic acoustic model and the acoustic segment constitute a characteristic sound bank; or

Establishing a characteristic acoustic model, and the characteristic acoustic model is composed of the characteristic acoustic library; or

Obtaining sound data corresponding to a specific text, and the specific text and the sound data constitute a characteristic sound bank; or

Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank; or

A characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.

In some embodiments, creating a characteristic acoustic model can include:

Obtaining characteristic sound data, and training the characteristic sound data to establish a characteristic acoustic model; or

Obtaining an existing acoustic model and characteristic sound data, and adaptively training the existing acoustic model according to the characteristic sound data to establish a characteristic acoustic model.

Among them, the sample size required to directly train the characteristic sound data to obtain the characteristic acoustic model is larger than the sample size required for the adaptive training of the existing acoustic model.

For example, recording/collecting sound data of a certain size and specific timbre, and performing artificial or automatic prosody labeling and boundary labeling, and training to obtain a characteristic acoustic model. Alternatively, using existing acoustic models, recording/collecting a small amount of sound data of a specific timbre, and updating the existing acoustic model to a characteristic acoustic model through adaptive model training techniques.

In some embodiments, obtaining an acoustic segment can include:

The training samples are segmented to obtain an acoustic segment.

For example, recording/collecting sound data of a certain size and specific timbre, and performing manual or automatic rhythm annotation and boundary labeling, and segmenting the acoustic segments.

In some embodiments, acquiring sound data corresponding to a specific text may include:

Select the specific text you want to recite;

Obtaining a recited voice of a particular speaker to the particular text;

The recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.

For example, ask a specific speaker to recite specific texts emotionally, obtain corresponding sound data, and customize the sound data.

Optionally, in order to save space, the acquired voice can be compressed, and the compressed voice is compressed. As the sound data that is finally saved in the featured sound bank.

In addition, different sound data can be obtained by reading different or different specific texts by different speakers, and then a plurality of specific texts can be stored correspondingly with the sound data to form a customized library.

The featured sound library information refers to relevant information generated for the featured sound library, for example, generator information, generation time, a version of a suitable offline language synthesis engine, a suitable field to which the text to be synthesized belongs, a male or female voice, or other distinctive sounds, Sound quality and so on.

After creating the featured library and featured library information, it can be stored. For example, referring to FIG. 3, after the creation module (represented by the manage console in FIG. 3) 31 creates a special sound library, the characteristic sound library can be stored in a storage module for storing the featured sound library (indicated by BOS cloud storage in FIG. 3). In the 32, after the characteristic sound library information is created, the characteristic sound library information is stored in a storage module (represented by mysql cluster in FIG. 3) 33 for storing the characteristic sound library information. In addition, the featured sound library information can be provided to the user as a query result in the subsequent process, and each query result can be used as one of the available sound library information in the available sound bank list.

S22: The SDK sends a query request to the server.

The SDK may send the query request when voice synthesis is required. For example, after the user opens the SDK and clicks a button for triggering voice synthesis, the SDK sends a query request to the server.

Referring to FIG. 3, the query request sent by the SDK 34 can be sent to the ingress node of the server (represented by the physical room in FIG. 3) 35.

S23: The server obtains the query result according to the query request.

The query request may include a query condition, for example, a version, a domain, a featured voice, and the like of the voice synthesis engine. After receiving the query request, the server obtains the query result that satisfies the query condition.

In order to cope with the explosive query request that may appear on the SDK side, the query result can be cached. See Figure 3 for an example of storing query results to the memcached cluster 36.

Therefore, when the server receives the query request, it can first query in the memcached cluster. If the query result that satisfies the query condition can be found, the query result can be obtained directly from the memcached cluster. Or, if you can't find the query result that meets the query condition in the memcached cluster, you can query it in the mysql cluster. When there are query results in the mysql cluster that satisfy the query condition, the query result is obtained from the mysql cluster, and The query results obtained from the mysql cluster are cached into the memcached cluster so that the query results can be obtained directly from the memcached cluster.

S24: The server obtains a list of available sound banks according to the query result.

For example, referring to FIG. 3, the physical equipment room obtains the query result from the memcached cluster. In addition, the storage address of the available sound library as the link information can be obtained from the BOS cloud storage, and the available sound library information can be used for each available sound library. Including: query results (such as version, domain, featured sounds, etc. for speech synthesis engine) and link information, The list of available sound banks can then be composed of information from multiple available sound banks.

S25: The server sends a list of available sound banks to the SDK.

S26: The SDK acquires a sound bank selected by the user according to the available sound bank list, and downloads the sound bank selected by the user from the server.

For example, after the SDK obtains the list of available sound banks, the list is displayed to the user, and the user can select an available sound bank according to the displayed information.

In addition, referring to FIG. 3, after storing the featured sound library, the storage module storing the featured sound library can send the storage address of the featured sound bank as the link information to the portal node of the server, and then the entry node will obtain the characteristic sound from mysql. The library information and the link information obtained from the storage module are used as information of the available sound bank, and are composed of a plurality of available sound library information to be sent to the SDK.

Wherein, when the storage module sends the link information to the ingress node, the identifier and the link information of the featured sound library may be correspondingly sent. When the characteristic sound bank information is stored in the mysql, the identifier of the featured sound bank is stored correspondingly, and the entry node is in the slave node. When obtaining the characteristic sound bank information in mysql, the corresponding identifier and information of the characteristic sound library are obtained, and then the information obtained from mysql can be associated with the link information obtained from the storage module according to the identifier of the featured sound library.

After the user selects the information of the available sound bank, the SDK downloads the selected sound bank from the server according to the link information included in the information selected by the user.

S27: The SDK uses the downloaded sound library to synthesize the text into speech.

After the SDK obtains the sound library, the sound library can be used to synthesize the text into speech to realize speech synthesis.

In speech synthesis, speech synthesis can be performed according to the information in the downloaded featured sound library and different speech synthesis methods.

Optionally, the using the downloaded sound library to synthesize the text into a voice includes:

When an acoustic model and an acoustic segment are included in the sound library, the text is processed, acoustic parameters are acquired according to the processed text and the acoustic model, and corresponding acoustic segments are acquired according to the acoustic parameters, and Acoustic segments are spliced and combined to obtain synthesized speech; or,

When the acoustic model is included in the sound library, the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or

When the sound library includes an acoustic model, specific text, and corresponding sound data, the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired Corresponding sound data, the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.

Take the sound library as a special sound library as an example. The specific content can be as follows:

In some embodiments, referring to FIG. 4, the flow of speech synthesis may include:

S41: Text preprocessing of the text to be synthesized.

S42: Perform text analysis on the preprocessed text.

S43: Prediction of text prosody after text analysis.

For details of S41-S43, refer to related processes of existing speech synthesis.

S44: Acquire acoustic parameters according to the prosperously predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.

Different from the prior art, the characteristic acoustic model adopted in this embodiment is not an existing acoustic model, and after determining the acoustic model, the flow of generating the acoustic parameters can be referred to the existing manner.

S45: Acquire corresponding acoustic segments in the featured sound library according to the acoustic parameters, perform stitching and synthesis on the acquired acoustic segments, and obtain synthesized speech corresponding to the text to be synthesized.

Wherein, when the acoustic segment is created, corresponding acoustic parameters can also be created, and then the acoustic parameters are stored corresponding to the acoustic segments in the featured sound bank, so that the corresponding acoustic segments can be found according to the acoustic parameters during speech synthesis.

After the acoustic segments are acquired, the segments can be spliced to obtain a text-corresponding speech for speech synthesis.

In some embodiments, referring to FIG. 5, the flow of speech synthesis may include:

S51: Perform text preprocessing on the synthesized text.

S52: Perform text analysis on the preprocessed text.

S53: Prediction of text prosody after text analysis.

For details of S51-S53, refer to related processes of existing speech synthesis.

S54: Acquire acoustic parameters according to the prosperous predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.

S55: Synthesizing the vocoder parameters according to the acoustic parameters, and acquiring the synthesized speech corresponding to the text to be synthesized.

Among them, the vocoder is a device capable of generating sound according to acoustic parameters, so that the device can output synthesized speech.

In some embodiments, referring to FIG. 6, the process of speech synthesis may include:

S61: Text preprocessing of the text to be synthesized.

S62: Determine whether there is sound data corresponding to the text to be synthesized in the featured sound bank, and if yes, execute S63; otherwise, execute S64.

When the specific text and the corresponding sound data are saved in the featured sound library, it is possible to determine whether the specific text in the characteristic sound library is consistent with the text to be synthesized is determined by the search method.

It can be understood that, since different speakers may adopt different recitation methods for the same text content, the specific text and the corresponding sound data may be completely consistent or consistent within the error range. For example, corresponding to the specific text "Front traffic lights, please pay attention to the traffic rules", different people may have different play, the sound recorded in a certain sound library may correspond to "Be careful, immediately traffic lights, red light will be fine !" The sound of such content.

S63: Acquire corresponding sound data.

For example, when there is a specific text in the featured sound library that is consistent with the text to be synthesized, the sound data corresponding to the specific text can be acquired.

The sound data can be used as the synthesized speech to be synthesized later after the sound data is acquired. Alternatively, if the sound data corresponding to the specific text is stored after the compression processing in the featured sound library, after the corresponding sound data is acquired in the featured sound library, the obtained sound data may be decompressed and decompressed. The processed sound data is used as a synthesized voice.

S64: Perform text analysis on the preprocessed text.

S65: Text prosody prediction after text analysis.

For details of S61, S64, and S65, refer to related processes of existing speech synthesis.

S66: Acquire acoustic parameters according to the prosperous predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.

S67: Perform vocoder parameter synthesis according to the acoustic parameters, and obtain synthesized speech corresponding to the text to be synthesized.

In some embodiments, referring to FIG. 7, the flow of speech synthesis may include:

S71: Text preprocessing of the text to be synthesized.

S72: Determine whether there is sound data corresponding to the text to be synthesized in the featured sound library, and if yes, execute S73; otherwise, execute S74.

It can be understood that, since different speakers may adopt different recitation methods for the same text content, the specific text and the corresponding sound data may be completely consistent or consistent within the error range. For example, corresponding to the specific text of “Front traffic lights, please pay attention to the traffic rules”, different people may have different play, the sound recorded in a certain sound library may correspond to “Be careful, immediately traffic lights, red light will be fined Money! "The sound of such content.

S73: Acquire corresponding sound data.

S74: Perform text analysis on the preprocessed text.

S75: Text prosody prediction after text analysis.

For details of S71, S74, and S75, refer to related processes of existing speech synthesis.

S76: Acquire acoustic parameters according to the prosperous predicted text and the characteristic acoustic model in the featured sound library to generate acoustic parameters.

S77: Acquire corresponding acoustic segments in the featured sound library according to the acoustic parameters, perform stitching and combining on the acquired acoustic segments, and obtain synthesized speech corresponding to the text to be synthesized.

In this embodiment, by downloading the sound bank from the server during speech synthesis, instead of directly including the sound bank in the APP, the volume of the APP can be reduced. In addition, the server can be included in the manner of including the sound bank in the APP. By storing more sound banks, you can provide users with more choices by downloading the sound library on the server. By including the featured sound library in the available sound library, you can meet the individual needs of users and enhance the user experience. By creating distinctive sound banks in different ways and synthesizing speech based on featured sound banks in different ways, it can meet the needs of different scenes and achieve diversification.

FIG. 8 is a schematic structural diagram of a voice synthesizing system according to another embodiment of the present invention. The system includes: a client device 81, and the client device 81 includes:

The query module 811 is configured to query, from the server, a list of available sound banks when the voice synthesis is required, where the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;

Different from the prior art that the sound library is included in the APP, in this embodiment, it is not necessary to include the sound in the APP. Library, but download from the server when a library is needed.

The obtaining module 812 is configured to obtain a sound bank selected by the user according to the available sound library list, and download a sound bank selected by the user from the server;

After the user selects the information of the available sound bank, the SDK can determine the available sound bank selected by the corresponding user, and download the available sound bank selected by the user from the server according to the selected information. For example, the information of the available sound library further includes link information. After the user selects the information of the available sound library, the corresponding available sound library can be downloaded according to the link information in the selected information.

The synthesizing module 813 is configured to synthesize the text into a voice by using the downloaded sound bank.

In some embodiments, referring to FIG. 9, the system further includes: a server device 82, the server device includes: a creating module 821 for creating a featured sound library, and the creating module 821 is specifically configured to:

In some embodiments, the creation module 821 is used to create a characteristic acoustic model, including:

In some embodiments, obtaining an acoustic segment can include:

The training samples are segmented to obtain an acoustic segment.

In some embodiments, the creating module 821 is configured to acquire sound data corresponding to a specific text, including:

Select the specific text you want to recite;

Obtaining a recited voice of a particular speaker to the particular text;

Optionally, in order to save space, the acquired recited voice may be compressed, and the compressed voice is used as the sound data finally stored in the featured sound bank.

In some embodiments, referring to FIG. 9, the system further includes: a first cluster system 822 and a second cluster system 823 at the server end, where the query module 811 is specifically configured to:

Sending a query request to the server, where the query request includes a query condition, so that the server obtains the query result according to the query condition, wherein when the query result exists in the first cluster system, the first Obtaining the query result in the cluster system, or acquiring the query result from the second cluster system when the query result does not exist in the first cluster system, and buffering the obtained query result to the first In a cluster system;

Receiving a list of available sound banks sent by the server, where the available sound library list is obtained by the server according to the query result.

The query request may include a query condition, for example, a version, a domain, and a feature of the speech synthesis engine. After the server receives the query request, it obtains the query result that satisfies the query condition.

The first cluster system here corresponds to the memcached cluster in the method embodiment.

For example, referring to FIG. 3, the physical equipment room obtains the query result from the memcached cluster. In addition, the storage address of the available sound library as the link information can be obtained from the BOS cloud storage, and the available sound library information can be used for each available sound library. These include: query results (such as versions, fields, featured sounds, etc. for speech synthesis engines) and link information, after which the list of available sound banks can be composed of information from multiple available sound banks.

In some embodiments, referring to FIG. 9, the information of the available sound library includes: corresponding generated information after creating an available sound library, the system further comprising: a second cluster system 823 located at the server, the second cluster system 823 is configured to store information corresponding to the generated generated sound library. After the available sound library is created, the corresponding generated information corresponds to the characteristic sound library information in the method embodiment, and the featured sound database information can be provided to the user as a query result in the subsequent process, and each query result can be used as each available sound in the available sound library list. A kind of information in the library information.

In some embodiments, referring to FIG. 9, the information of the available sound library includes: link information of the available sound library, the system further includes: a storage module 824 located at the server, the storage module 824 is configured to store the generated available sound Library, and the storage address of the available sound bank as the link information.

The information generated correspondingly after the creation of the available sound bank corresponds to the characteristic sound library information in the above embodiment, and the featured sound library information refers to related information generated for the featured sound library, for example, the generator information, the generation time, and the suitable The version of the offline language synthesis engine, the field to which the text to be synthesized belongs, the male or female voice or other distinctive sounds, sound quality, and the like.

Therefore, the second cluster system here corresponds to the mysql cluster in the method embodiment. Storage module pair here The BOS cloud storage in the method embodiment.

After the user selects the information of the available sound bank, the SDK downloads the selected sound bank from the server according to the link information in the information selected by the user.

In some embodiments, the synthesizing module 813 is specifically configured to:

For details of the content of the speech synthesis, refer to FIG. 4-7, and details are not described herein again.

An embodiment of the present invention further provides an electronic device, including: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when the one or more When the processor executes:

When voice synthesis is required, the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;

Obtaining a sound bank selected by the user according to the available sound library list, and downloading a sound bank selected by the user from the server;

The text is synthesized into speech using the downloaded sound bank.

Embodiments of the present invention also provide a non-volatile computer storage medium storing one or more modules when the one or more modules are executed:

The text is synthesized into speech using the downloaded sound bank.

It should be noted that in the description of the present invention, the terms "first", "second" and the like are used for descriptive purposes only, and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise stated.

Any process or method description in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a particular logical function or process. And the scope of the preferred embodiments of the invention includes additional implementations, in which the functions may be performed in a substantially simultaneous manner or in an opposite order depending on the functions involved, in the order shown or discussed. It will be understood by those skilled in the art to which the embodiments of the present invention pertain.

It should be understood that portions of the invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

One of ordinary skill in the art can understand that all or part of the steps carried by the method of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, one or a combination of the steps of the method embodiments is included.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.

The above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. , structure, material or characteristics It is included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms does not necessarily mean the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.

Although the embodiments of the present invention have been shown and described, it is understood that the above-described embodiments are illustrative and are not to be construed as limiting the scope of the invention. The embodiments are subject to variations, modifications, substitutions and variations.

Claims

A speech synthesis method, comprising:

When voice synthesis is required, the available sound library list is queried from the server, and the available sound library list includes information of a plurality of available sound banks, and the available sound library includes a featured sound library;

Obtaining a sound bank selected by the user according to the available sound library list, and downloading a sound bank selected by the user from the server;

The text is synthesized into speech using the downloaded sound bank.
The method of claim 1 further comprising: creating a featured sound bank, said creating a featured sound bank comprising:

Establishing a characteristic acoustic model and acquiring an acoustic segment, and the characteristic acoustic model and the acoustic segment constitute a characteristic sound bank; or

Establishing a characteristic acoustic model, and the characteristic acoustic model is composed of the characteristic acoustic library; or

Obtaining sound data corresponding to a specific text, and the specific text and the sound data constitute a characteristic sound bank; or

Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank; or

A characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
The method of claim 2 wherein said establishing a characteristic acoustic model comprises:

Obtaining characteristic sound data, and training the characteristic sound data to establish a characteristic acoustic model; or

Obtaining an existing acoustic model and characteristic sound data, and adaptively training the existing acoustic model according to the characteristic sound data to establish a characteristic acoustic model.
The method according to claim 2 or 3, wherein the acquiring the sound data corresponding to the specific text comprises:

Select the specific text you want to recite;

Obtaining a recited voice of a particular speaker to the particular text;

The recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
The method according to any one of claims 1-4, wherein the querying the list of available sound banks from the server comprises:

Sending a query request to the server, where the query request includes a query condition, so that the server obtains the query result according to the query condition, wherein when the query result exists in the first cluster system, the first Cluster system Obtaining the query result in the system, or, when the query result does not exist in the first cluster system, obtaining the query result from the second cluster system, and buffering the obtained query result to the first In a cluster system;

Receiving a list of available sound banks sent by the server, where the available sound library list is obtained by the server according to the query result.
The method according to any one of claims 1 to 5, wherein the information of the available sound library comprises: correspondingly generated information after creating an available sound bank, the information being stored in a second cluster system of the server. in.
The method according to any one of claims 1-6, wherein the information of the available sound library comprises: link information of the available sound library, and the downloading the sound bank selected by the user from the server comprises:

And downloading, according to the link information, a corresponding sound bank from the server, wherein the link information is a storage address after storing the available sound bank.
The method according to any one of claims 1 to 7, wherein the synthesizing the text into a voice using the downloaded sound library comprises:

When an acoustic model and an acoustic segment are included in the sound library, the text is processed, acoustic parameters are acquired according to the processed text and the acoustic model, and corresponding acoustic segments are acquired according to the acoustic parameters, and Acoustic segments are spliced and combined to obtain synthesized speech; or,

When the acoustic model is included in the sound library, the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or

When the sound library includes an acoustic model, specific text, and corresponding sound data, the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired Corresponding sound data, the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
A speech synthesis system, comprising: a client device, the client device comprising:

a query module, configured to query, from the server, a list of available sound banks, where the available sound library list includes information of a plurality of available sound banks, where the available sound library includes a featured sound library;

An obtaining module, configured to acquire a sound bank selected by the user according to the available sound library list, and download a sound bank selected by the user from the server;

A synthesis module that synthesizes text into speech using a downloaded sound bank.
The system of claim 9, further comprising: a server device, the server device comprising a creation module for creating a featured sound library, the creation module being specifically configured to:

Establishing a characteristic acoustic model and acquiring an acoustic segment, and the characteristic acoustic model and the acoustic segment constitute a characteristic sound bank; or

Establishing a characteristic acoustic model, and the characteristic acoustic model is composed of the characteristic acoustic library; or

Acquiring sound data corresponding to a specific text, and the specific text and the sound data constitute a characteristic sound bank; or By,

Establishing a characteristic acoustic model, acquiring an acoustic segment, and acquiring sound data corresponding to the specific text, the characteristic acoustic model, the acoustic segment, and the specific text and the sound data composing a characteristic sound bank; or

A characteristic acoustic model is established to acquire sound data corresponding to a specific text, and the characteristic acoustic model, and the specific text and the sound data constitute a characteristic sound bank.
The system of claim 10 wherein said creating module is operative to create a characteristic acoustic model comprising:

Obtaining characteristic sound data, and training the characteristic sound data to establish a characteristic acoustic model; or

Obtaining an existing acoustic model and characteristic sound data, and adaptively training the existing acoustic model according to the characteristic sound data to establish a characteristic acoustic model.
The system according to claim 10 or 11, wherein the creating module is configured to acquire sound data corresponding to the specific text, including:

Select the specific text you want to recite;

Obtaining a recited voice of a particular speaker to the particular text;

The recited voice or the voice compressed by the recited voice is used as sound data corresponding to the specific text.
The system according to any one of claims 9 to 12, further comprising: a first cluster system and a second cluster system at the server end, wherein the query module is specifically configured to:

Sending a query request to the server, where the query request includes a query condition, so that the server obtains the query result according to the query condition, wherein when the query result exists in the first cluster system, the first Obtaining the query result in the cluster system, or acquiring the query result from the second cluster system when the query result does not exist in the first cluster system, and buffering the obtained query result to the first In a cluster system;

Receiving a list of available sound banks sent by the server, where the available sound library list is obtained by the server according to the query result.
The system according to any one of claims 9 to 13, wherein the information of the available sound library comprises: corresponding information generated after creating an available sound library, the system further comprising: a second cluster located at the server end The second cluster system is configured to store the information corresponding to the generated sound library.
The system according to any one of claims 9 to 14, wherein the information of the available sound library comprises: link information of the available sound library, the system further comprising: a storage module located at the server, the storage module Used to store the generated available sound bank and use the storage address of the available sound bank as the link information.
The system according to any one of claims 9 to 15, wherein the synthesis module is specifically configured to:

When the acoustic library and the acoustic segment are included in the sound library, the text is processed according to the processed text and the Obtaining an acoustic parameter, acquiring a corresponding acoustic segment according to the acoustic parameter, and splicing and synthesizing the acquired acoustic segment to obtain a synthesized speech; or

When the acoustic model is included in the sound library, the text is processed, the acoustic parameters are acquired according to the processed text and the acoustic model, and the vocoder parameters are synthesized according to the acoustic parameters to obtain the synthesized speech; or

When the sound library includes an acoustic model, specific text, and corresponding sound data, the text is preprocessed, and when the specific text corresponding to the preprocessed text exists in the sound library, the specific text is acquired Corresponding sound data, the sound data or the sound data obtained by decompressing the sound data is used as a synthesized voice.
An electronic device, comprising:

One or more processors;

Memory

One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:

Performing the method of any of claims 1-8.
A non-volatile computer storage medium characterized in that the computer storage medium stores one or more modules when the one or more modules are executed:

Performing the method of any of claims 1-8.