CN104992703A

CN104992703A - Speech synthesis method and system

Info

Publication number: CN104992703A
Application number: CN201510441079.6A
Authority: CN
Inventors: 李秀林; 白洁; 李维高; 唐海员
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-07-24
Filing date: 2015-07-24
Publication date: 2015-10-21
Anticipated expiration: 2035-07-24
Also published as: CN104992703B; WO2017016135A1

Abstract

The invention provides a speech synthesis method and a system. The speech synthesis method comprises steps: when speech synthesis is needed, an available speech library list is queried from a server side, wherein the available speech library list comprises information of multiple available speech libraries, and the available speech libraries comprise a characteristic speech library; a speech library selected by a user according to the available speech library list is acquired, and the speech library selected by the user is downloaded from the server side; and the downloaded sound library is adopted to synthesize a text into a speech. According to the method of the invention, the size of an offline speech synthesis APP can be reduced, more choices can be provided for the user, and personalized speech synthesis can be realized.

Description

Speech synthesis method and system

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech synthesis method and system.

Background

In the prior art, when a user downloads an offline Speech synthesis Application (APP), the APP includes one or two sound libraries, and when the user uses the APP, the user selects one sound library, and then the APP uses the sound library selected by the user To perform Speech synthesis (Text To Speech, TTS) on a Text To be played.

However, in the scheme of the prior art, on one hand, the APP includes a sound library, and since the sound library files are generally large, the volume of the APP is large, and on the other hand, the types of the sound library included in the APP are limited, so that the user selection space is limited.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a speech synthesis method, which can reduce the volume of an offline speech synthesis APP, and can provide more choices for a user to implement personalized speech synthesis.

Another object of the present invention is to provide a speech synthesis system.

In order to achieve the above object, an embodiment of the present invention provides a speech synthesis method, including: when speech synthesis is needed, inquiring an available sound library list from a server, wherein the available sound library list comprises information of a plurality of available sound libraries, and the available sound libraries comprise characteristic sound libraries; acquiring a sound library selected by a user according to the available sound library list, and downloading the sound library selected by the user from a server; and synthesizing the text into voice by adopting the downloaded voice library.

In the speech synthesis method provided in the first aspect of the present invention, the volume of the APP can be reduced by downloading the sound library from the server instead of directly including the sound library in the APP during speech synthesis, and in addition, compared with a mode in which the sound library is included in the APP, more sound libraries can be stored in the server.

In order to achieve the above object, a speech synthesis system according to a second embodiment of the present invention includes: a client device, the client device comprising: the system comprises a query module, a voice synthesis module and a voice database processing module, wherein the query module is used for querying an available voice database list from a server when voice synthesis is needed, the available voice database list comprises information of a plurality of available voice databases, and the available voice databases comprise characteristic voice databases; the acquisition module is used for acquiring the sound library selected by the user according to the available sound library list and downloading the sound library selected by the user from the server; and the synthesis module is used for synthesizing the text into voice by adopting the downloaded voice library.

In the speech synthesis system provided in the embodiment of the second aspect of the present invention, the volume of the APP can be reduced by downloading the sound library from the server instead of directly including the sound library in the APP during speech synthesis, and in addition, compared with a mode of including the sound library in the APP, more sound libraries can be stored in the server.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for speech synthesis according to another embodiment of the present invention;

FIG. 3 is a diagram illustrating an exemplary embodiment of a speech synthesis system according to the present invention;

FIG. 4 is a flow chart of speech synthesis according to a specific example in the embodiment of the present invention;

FIG. 5 is a flow diagram illustrating speech synthesis according to another specific example of an embodiment of the present invention;

FIG. 6 is a flow chart illustrating speech synthesis according to another specific example of the embodiment of the present invention;

FIG. 7 is a flow chart illustrating speech synthesis according to another specific example of the embodiment of the present invention;

FIG. 8 is a schematic diagram of a speech synthesis system according to another embodiment of the present invention;

fig. 9 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar modules or modules having the same or similar functionality throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention, where the method includes:

s11: when speech synthesis is needed, inquiring an available sound library list from a server, wherein the available sound library list comprises information of a plurality of available sound libraries, and the available sound libraries comprise characteristic sound libraries.

Unlike the prior art in which the sound library is directly contained in the APP, in the present embodiment, the sound library does not need to be contained in the APP, but is downloaded from the server when the sound library is needed.

For example, a Software Development Kit (SDK) corresponding to an APP on a client sends a query request to a server, where the query request is used to request an available sound library list, and the server obtains the available sound library list after receiving the query request and sends the obtained available sound library list to the SDK.

The available sound library in this embodiment includes a distinctive sound library, and it is understood that the available sound library may also include an existing general sound library.

The characteristic sound library is generated in advance, and is used for meeting personalized requirements, and the sound library is different from a common sound library, such as a children sound library, or a user customized sound library.

S12: and acquiring the sound library selected by the user according to the available sound library list, and downloading the sound library selected by the user from the server.

After the SDK obtains the available sound library list from the server, the available sound library list can be displayed to the user, and during the display, information of each available sound library can be specifically displayed, for example, information of a generator of the available sound library, generation time, a version of a suitable offline language synthesis engine, a field to which a suitable text to be synthesized belongs, male or female voices or other characteristic voices, voice quality and the like, so that the user can conveniently select the available sound library list.

The user can select information of one or more available sound libraries according to the displayed information.

After the user selects the information of the available sound library, the SDK can determine the corresponding available sound library selected by the user and download the available sound library selected by the user from the server. For example, the information of the available sound library also includes link information, and after the user selects the information of the available sound library, the corresponding available sound library can be downloaded according to the link information in the information.

S13: and synthesizing the text into voice by adopting the downloaded voice library.

After the SDK downloads the sound library from the server, the sound library can be adopted to realize voice synthesis.

In the embodiment, the sound library is downloaded from the server side during voice synthesis instead of directly comprising the sound library in the APP, so that the volume of the APP can be reduced, in addition, compared with a mode of comprising the sound library in the APP, more sound libraries can be stored in the server side, more choices can be provided for a user by downloading the sound library in the server side, personalized requirements of the user can be met by including the characteristic sound library in the available sound library, and user experience is improved.

Fig. 2 is a flowchart illustrating a speech synthesis method according to another embodiment of the present invention, in which the method includes:

s21: and the server creates a characteristic sound library and corresponding characteristic sound library information, and stores the characteristic sound library and the characteristic sound library information.

Wherein, creating the characteristic sound library may include:

establishing a characteristic acoustic model and obtaining acoustic fragments, and forming a characteristic sound library by the characteristic acoustic model and the acoustic fragments; or,

establishing a characteristic acoustic model, and forming a characteristic sound library by the characteristic acoustic model; or,

acquiring sound data corresponding to a specific text, and forming a characteristic sound library by the specific text and the sound data; or,

establishing a characteristic acoustic model, acquiring acoustic fragments, acquiring sound data corresponding to a specific text, and forming a characteristic sound library by the characteristic acoustic model, the acoustic fragments and the specific text and the sound data; or,

establishing a characteristic acoustic model, acquiring sound data corresponding to a specific text, and forming a characteristic sound library by the specific text and the sound data.

In some embodiments, establishing the distinctive acoustic model may include:

acquiring characteristic sound data, training the characteristic sound data, and establishing a characteristic acoustic model; or,

the method comprises the steps of obtaining an existing acoustic model and characteristic sound data, conducting self-adaptive training on the existing acoustic model according to the characteristic sound data, and building the characteristic acoustic model.

The sample size required when the characteristic acoustic model is obtained by directly training the characteristic sound data is larger than that required when the existing acoustic model is subjected to self-adaptive training.

For example, recording/collecting sound data of a certain scale and a specific tone, performing prosody labeling and boundary labeling manually or automatically, and training to obtain a distinctive acoustic model. Or, recording/collecting a small amount of sound data of a specific tone color by using the existing acoustic model, and updating the existing acoustic model into the characteristic acoustic model by using an adaptive model training technology.

In some embodiments, acquiring the acoustic segments may include:

and segmenting the training sample to obtain the acoustic fragments.

For example, recording/collecting sound data of a certain scale and a specific tone, performing manual or automatic prosody labeling and boundary labeling, and segmenting to obtain acoustic segments.

In some embodiments, obtaining sound data corresponding to a particular text may include:

selecting a specific text to be recited;

acquiring recited voice of a specific speaker to the specific text;

and taking the recited voice or the voice obtained by compressing the recited voice as the voice data corresponding to the specific text.

For example, a specific speaker is asked to read a specific text with emotion to acquire corresponding voice data, thereby customizing the voice data.

Optionally, in order to save space, the retrieved recited speech may be compressed, and the compressed speech is used as the sound data that is finally stored in the feature library.

In addition, different pronouncing persons can recite the same or different specific texts to acquire different sound data, and then a plurality of specific texts and the sound data can be respectively and correspondingly stored to form a customized library.

The characteristic sound library information refers to relevant information generated for the characteristic sound library, such as generator information, generation time, version of an appropriate offline language synthesis engine, a field to which an appropriate text to be synthesized belongs, male or female voice or other characteristic sounds, sound quality, and the like.

After the distinctive sound library and the distinctive sound library information are created, they can be stored. For example, referring to fig. 3, after the creation module (denoted by a manager cluster in fig. 3) 31 creates the distinctive sound library, the distinctive sound library may be stored in a storage module (denoted by BOS cloud storage in fig. 3) 32 for storing the distinctive sound library, and after the distinctive sound library information is created, the distinctive sound library information may be stored in a storage module (denoted by mysql cluster in fig. 3) 33 for storing the distinctive sound library information. In addition, the distinctive sound library information can be provided to the user as query results in the subsequent process, and each query result can be used as one of the information of each available sound library in the available sound library list.

S22: and the SDK sends a query request to the server.

The SDK may send the query request when speech synthesis is needed, for example, after the user opens the SDK and clicks a button for triggering speech synthesis, the SDK sends the query request to the server.

Referring to fig. 3, the query request sent by the SDK 34 may be sent to an ingress node (represented by a physical machine room in fig. 3) 35 of the server.

S23: and the server side acquires the query result according to the query request.

The query request may include query conditions, such as version, domain, and feature sound of the speech synthesis engine, and the server obtains a query result satisfying the query conditions after receiving the query request.

In order to deal with the explosive query request which may occur at the SDK end, the query result can be cached. Referring to fig. 3, the query results are stored in the memcached cluster 36 as an example.

Therefore, after receiving the query request, the server may first query in the memcached cluster, and if the query result meeting the query condition can be found, the server may directly obtain the query result from the memcached cluster. Or if the query result meeting the query condition cannot be found in the memcached cluster, the query can be performed in the mysql cluster, when the query result meeting the query condition exists in the mysql cluster, the query result is obtained from the mysql cluster, and the query result obtained from the mysql cluster can be cached in the memcached cluster, so that the query result can be directly obtained from the memcached cluster in the following process.

S24: and the server side acquires the available sound library list according to the query result.

For example, referring to fig. 3, the physical computer room obtains the query result from the memcached cluster, and in addition, may also obtain the storage address of the available sound library as the link information from the BOS cloud storage, so that the information of the available sound library may include, for each available sound library: the query results (e.g., version, domain, feature sounds, etc. appropriate to the speech synthesis engine) and link information, and then a list of available sound libraries may be composed from information of the plurality of available sound libraries.

S25: and the server side sends the available sound library list to the SDK.

S26: and the SDK acquires the sound library selected by the user according to the available sound library list and downloads the sound library selected by the user from the server.

For example, after the SDK acquires the list of available sound libraries, the list is displayed to the user, and the user may select an available sound library according to the displayed information.

In addition, referring to fig. 3, after the characteristic sound library is stored, the storage module storing the characteristic sound library may send the storage address of the characteristic sound library as link information to the portal node of the server, and then the portal node takes the characteristic sound library information acquired from mysql and the link information acquired from the storage module as information of an available sound library, and forms an available sound library list from information of a plurality of available sound libraries and sends the available sound library list to the SDK.

The storage module can correspondingly send the identification and the link information of the characteristic sound library when sending the link information to the entry node, correspondingly store the identification and the information of the characteristic sound library when the characteristic sound library information is stored in the mysql, correspondingly obtain the identification and the information of the characteristic sound library when the entry node obtains the characteristic sound library information from the mysql, and then associate the information obtained from the mysql with the link information obtained from the storage module according to the identification of the characteristic sound library.

And after the user selects the information of the available sound library, the SDK downloads the selected sound library from the server according to the link information included in the information selected by the user.

S27: the SDK synthesizes the text into speech using the downloaded library.

After the SDK obtains the sound library, the sound library can be adopted to synthesize the text into voice, so that voice synthesis is realized.

During voice synthesis, voice synthesis can be performed according to the downloaded information in the characteristic sound library and different voice synthesis modes.

Optionally, the synthesizing the text into a voice by using the downloaded sound library includes:

when the sound library comprises the acoustic model and the acoustic fragments, processing the text, acquiring acoustic parameters according to the processed text and the acoustic model, acquiring corresponding acoustic fragments according to the acoustic parameters, and splicing and synthesizing the acquired acoustic fragments to acquire synthesized voice; or,

when the acoustic model is included in the sound library, processing the text, acquiring acoustic parameters according to the processed text and the acoustic model, and synthesizing vocoder parameters according to the acoustic parameters to acquire synthesized voice; or,

when the sound library comprises an acoustic model, a specific text and corresponding sound data, preprocessing the text, and when the specific text consistent with the preprocessed text exists in the sound library, acquiring the sound data corresponding to the specific text, and taking the sound data or the sound data subjected to decompression processing on the sound data as synthesized voice.

Taking the sound library as a characteristic sound library as an example, the specific contents can be as follows:

in some embodiments, referring to fig. 4, the process of speech synthesis may include:

s41: and performing text preprocessing on the text to be synthesized.

S42: and performing text analysis on the preprocessed text.

S43: and predicting the text prosody after the text analysis.

The details of S41-S43 can be found in the related procedures of existing speech synthesis.

S44: and generating acoustic parameters according to the text subjected to prosody prediction and the characteristic acoustic model in the characteristic sound library.

Unlike the prior art, the present embodiment uses a characteristic acoustic model instead of an existing acoustic model, and after determining the acoustic model, the process of generating the acoustic parameters may be referred to in the existing manner.

S45: and acquiring corresponding acoustic fragments in the characteristic sound library according to the acoustic parameters, splicing and synthesizing the acquired acoustic fragments, and acquiring synthesized voice corresponding to the text to be synthesized.

Corresponding acoustic parameters can be created when the acoustic segments are created, and then the acoustic parameters and the acoustic segments are correspondingly stored in the characteristic sound library, so that the corresponding acoustic segments can be found according to the acoustic parameters when the voice is synthesized.

After the acoustic fragments are obtained, the fragments can be spliced, so that the voice corresponding to the text is obtained, and the voice synthesis is realized.

In some embodiments, referring to fig. 5, the process of speech synthesis may include:

s51: and performing text preprocessing on the text to be synthesized.

S52: and performing text analysis on the preprocessed text.

S53: and predicting the text prosody after the text analysis.

The details of S51-S53 can be found in the related procedures of existing speech synthesis.

S54: and generating acoustic parameters according to the text subjected to prosody prediction and the characteristic acoustic model in the characteristic sound library.

S55: and synthesizing parameters of the vocoder according to the acoustic parameters to obtain synthesized voice corresponding to the text to be synthesized.

Among them, a vocoder is a device capable of generating sound according to acoustic parameters, and thus synthesized speech can be output using the device.

In some embodiments, referring to fig. 6, the process of speech synthesis may include:

s61: and performing text preprocessing on the text to be synthesized.

S62: and judging whether the sound data corresponding to the text to be synthesized exists in the characteristic sound library, if so, executing S63, otherwise, executing S64.

When the specific text and the corresponding sound data are stored in the characteristic sound library, whether the specific text consistent with the text to be synthesized is stored in the characteristic sound library or not can be judged in a searching mode.

It will be appreciated that since different speakers may recite different text content for the same text, the particular text and corresponding sound data may be identical or consistent within a margin of error. For example, corresponding to a specific text "traffic light ahead, please note that the traffic regulations are complied with", different people may play differently, and the sound recorded in a certain sound bank may correspond to "caution point, traffic light right now, penalty for running red light! "sound of such content.

S63: corresponding sound data is acquired.

For example, when a specific text corresponding to the text to be synthesized exists in the feature library, the sound data corresponding to the specific text may be acquired.

The sound data may be taken as a synthesized voice to be finally synthesized after the sound data is acquired. Or, if the sound data corresponding to the specific text is stored after being compressed in the feature sound library, after the corresponding sound data is acquired in the feature sound library, the acquired sound data may be decompressed, and the decompressed sound data may be used as the synthesized speech.

S64: and performing text analysis on the preprocessed text.

S65: and predicting the text prosody after the text analysis.

The specific contents of S61, S64 and S65 can be found in the related procedures of existing speech synthesis.

S66: and generating acoustic parameters according to the text subjected to prosody prediction and the characteristic acoustic model in the characteristic sound library.

S67: and synthesizing parameters of the vocoder according to the acoustic parameters to obtain synthesized voice corresponding to the text to be synthesized.

In some embodiments, referring to fig. 7, the process of speech synthesis may include:

s71: and performing text preprocessing on the text to be synthesized.

S72: and judging whether the sound data corresponding to the text to be synthesized exists in the characteristic sound library, if so, executing S73, otherwise, executing S74.

S73: corresponding sound data is acquired.

S74: and performing text analysis on the preprocessed text.

S75: and predicting the text prosody after the text analysis.

The specific contents of S71, S74 and S75 can be found in the related procedures of existing speech synthesis.

S76: and generating acoustic parameters according to the text subjected to prosody prediction and the characteristic acoustic model in the characteristic sound library.

S77: and acquiring corresponding acoustic fragments in the characteristic sound library according to the acoustic parameters, splicing and synthesizing the acquired acoustic fragments, and acquiring synthesized voice corresponding to the text to be synthesized.

In the embodiment, the sound library is downloaded from the server side during voice synthesis instead of directly comprising the sound library in the APP, so that the volume of the APP can be reduced, in addition, compared with a mode of comprising the sound library in the APP, more sound libraries can be stored in the server side, more choices can be provided for a user by downloading the sound library in the server side, personalized requirements of the user can be met by including the characteristic sound library in the available sound library, and user experience is improved. By adopting different modes to create the characteristic sound library and adopting different modes to carry out voice synthesis according to the characteristic sound library, different scene requirements can be met, and diversification is realized.

Fig. 8 is a schematic structural diagram of a speech synthesis system according to another embodiment of the present invention, which includes: a client apparatus 81, the client apparatus 81 including:

the query module 811 is configured to query, when speech synthesis is required, an available sound library list from a server, where the available sound library list includes information of a plurality of available sound libraries, and the available sound libraries include a feature sound library;

An obtaining module 812, configured to obtain a sound library selected by the user according to the available sound library list, and download the sound library selected by the user from the server;

After the user selects the information of the available sound library, the SDK can determine the corresponding available sound library selected by the user and download the available sound library selected by the user from the server according to the selected information. For example, the information of the available sound library also includes link information, and after the user selects the information of the available sound library, the corresponding available sound library can be downloaded according to the link information in the selected information.

And a synthesis module 813, configured to synthesize the text into speech using the downloaded sound library.

In some embodiments, referring to fig. 9, the system further comprises: a server-side device 82, the server-side device comprising: a creation module 821 for creating a library of feature sounds, the creation module 821 being specifically configured to:

In some embodiments, the creating module 821 is configured to create a distinctive acoustic model, including:

In some embodiments, acquiring the acoustic segments may include:

and segmenting the training sample to obtain the acoustic fragments.

In some embodiments, the creating module 821 is configured to obtain sound data corresponding to a specific text, and includes:

selecting a specific text to be recited;

acquiring recited voice of a specific speaker to the specific text;

In some embodiments, referring to fig. 9, the system further comprises: the query module 811 is specifically configured to:

sending a query request to a server, wherein the query request comprises a query condition, so that the server obtains a query result according to the query condition, wherein when the query result exists in a first cluster system, the query result is obtained from the first cluster system, or when the query result does not exist in the first cluster system, the query result is obtained from a second cluster system, and the obtained query result is cached in the first cluster system;

and receiving an available sound library list sent by the server, wherein the available sound library list is obtained by the server according to the query result.

The first cluster system herein corresponds to the memcached cluster in the method embodiment.

In some embodiments, referring to fig. 9, the information of the available sound library includes: after the available sound library is created, the system also comprises the following information which is correspondingly generated: and a second cluster system 823 located at the server, where the second cluster system 823 is configured to store information correspondingly generated after the available sound library is created. The information generated correspondingly after the available sound library is created corresponds to the characteristic sound library information in the embodiment of the information corresponding method, the characteristic sound library information can be provided for a user as a query result in a subsequent process, and each query result can be used as one information in each available sound library information in the available sound library list.

In some embodiments, referring to fig. 9, the information of the available sound library includes: link information of available sound libraries, the system further comprising: a storage module 824 at the server, where the storage module 824 is configured to store the generated available sound library, and use a storage address of the available sound library as the link information.

The information generated correspondingly after the available sound library is created corresponds to the characteristic sound library information in the above embodiment, and the characteristic sound library information refers to the relevant information generated for the characteristic sound library, such as the generator information, the generation time, the version of the suitable offline language synthesis engine, the domain to which the suitable text to be synthesized belongs, male or female voice or other characteristic voice, the sound quality, and the like.

Therefore, the second cluster system herein corresponds to the mysql cluster in the method embodiment. The storage module herein corresponds to BOS cloud storage in the method embodiment.

And after the user selects the information of the available sound library, the SDK downloads the selected sound library from the server according to the link information in the information selected by the user.

In some embodiments, the synthesis module 813 is specifically configured to:

For details of the speech synthesis, reference may be made to fig. 4-7, which are not described herein again.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of speech synthesis, comprising:

when speech synthesis is needed, inquiring an available sound library list from a server, wherein the available sound library list comprises information of a plurality of available sound libraries, and the available sound libraries comprise characteristic sound libraries;

acquiring a sound library selected by a user according to the available sound library list, and downloading the sound library selected by the user from a server;

and synthesizing the text into voice by adopting the downloaded voice library.

2. The method of claim 1, further comprising: creating a library of distinctive sounds, the creating a library of distinctive sounds comprising:

3. The method of claim 2, wherein the building a distinctive acoustic model comprises:

4. The method of claim 2, wherein the obtaining sound data corresponding to a specific text comprises:

selecting a specific text to be recited;

acquiring recited voice of a specific speaker to the specific text;

5. The method according to any one of claims 1-4, wherein said querying the list of available sound libraries from the server comprises:

6. The method according to any one of claims 1-4, wherein the information of the available sound library comprises: and correspondingly generating information after the available sound library is created, wherein the information is stored in a second cluster system of the server.

7. The method according to any one of claims 1-4, wherein the information of the available sound library comprises: the link information of the available sound library, the downloading of the sound library selected by the user from the server, comprises:

and downloading a corresponding sound library from the server according to the link information, wherein the link information is a storage address after storing the available sound library.

8. The method according to any one of claims 1-4, wherein synthesizing text into speech using the downloaded sound library comprises:

9. A speech synthesis system, comprising: a client device, the client device comprising:

the system comprises a query module, a voice synthesis module and a voice database processing module, wherein the query module is used for querying an available voice database list from a server when voice synthesis is needed, the available voice database list comprises information of a plurality of available voice databases, and the available voice databases comprise characteristic voice databases;

the acquisition module is used for acquiring the sound library selected by the user according to the available sound library list and downloading the sound library selected by the user from the server;

and the synthesis module is used for synthesizing the text into voice by adopting the downloaded voice library.

10. The system of claim 9, further comprising: the server device comprises a creation module for creating a characteristic sound library, and the creation module is specifically used for:

11. The system of claim 10, wherein the creation module is configured to build a distinctive acoustic model, comprising:

12. The system of claim 10, wherein the creation module is configured to obtain sound data corresponding to a particular text, and comprises:

selecting a specific text to be recited;

acquiring recited voice of a specific speaker to the specific text;

13. The system of any one of claims 9-12, further comprising: the query module is specifically configured to:

14. The system according to any one of claims 9-12, wherein the information of the available sound library comprises: after the available sound library is created, the system also comprises the following information which is correspondingly generated: and the second cluster system is positioned at the server and used for storing the information correspondingly generated after the available sound library is created.

15. The system according to any one of claims 9-12, wherein the information of the available sound library comprises: link information of available sound libraries, the system further comprising: and the storage module is positioned at the server and used for storing the generated available sound library and taking the storage address of the available sound library as the link information.

16. The system according to any one of claims 9-12, wherein the synthesis module is specifically configured to: