CN109147760A

CN109147760A - Synthesize method, apparatus, system and the equipment of voice

Info

Publication number: CN109147760A
Application number: CN201710508321.6A
Authority: CN
Inventors: 王玉平
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2019-01-04

Abstract

The invention discloses a kind of method, apparatus, system and equipment for synthesizing voice.Wherein, this method comprises: receiving text and index information to be converted；Corresponding phonetic dictionary is obtained according to index information, wherein the corresponding phonetic dictionary of different index information characterizes the sound producing pattern under different application environments；Speech synthesis service processing text to be converted and corresponding phonetic dictionary are called, the voice after generating synthesis.The present invention solves the not high technical problem of voice accuracy that existing speech synthesis system generates.

Description

Synthesize method, apparatus, system and the equipment of voice

Technical field

The present invention relates to speech synthesis technique fields, in particular to a kind of method, apparatus, system for synthesizing voice And equipment.

Background technique

Speech synthesis is to generate the technology of artificial voice by mechanical, electronics method.TTS technology (also known as Wen Yuzhuan Change technology) it is under the jurisdiction of speech synthesis, text information computer-internal can be generated or externally input is converted into voice and broadcasts It quotes and, for example, automatic telephone customer service, sound novel etc. is realized using speech synthesis technique.As user is to voice The increasingly increase of synthesis demand, people's pairing at voice requirement increasingly diversity.Thus, how to improve speech synthesis system Accuracy be converted into rich in emotion, be more nearly the voice of human language, be future speech synthesis system and by text information One of important topic of system.

Existing speech synthesis system can only provide a kind of pronunciation generally, for a word, phrase or sentence. And in practical applications, different application scenarios, the specific word of user, phrase or sentence might have different pronunciations, example Such as, when making name or place name, it sometimes may be not that pronunciation differs greatly with usual pronunciation for some words or word Same pronunciation.If, for the word or word of special-purpose, the pronunciation of synthesis may be wrong using existing speech synthesis system Accidentally, or even certain ambiguities can be caused.On the other hand, same word, word, phrase or sentence be under different application scenarios, Speech intonation is often also different.Thus, it is traditional either from application scenarios, or from the consideration of the demand of particularization Speech synthesis system can not solve this problem.

For above-mentioned problem, currently no effective solution has been proposed.

Summary of the invention

It is existing at least to solve the embodiment of the invention provides a kind of method, apparatus, system and equipment for synthesizing voice The not high technical problem of the voice accuracy that speech synthesis system generates.

According to an aspect of an embodiment of the present invention, a kind of method for synthesizing voice is provided, comprising: receive to be converted Text and index information；Corresponding phonetic dictionary is obtained according to index information, wherein the corresponding phonetic dictionary of different index information Characterize the sound producing pattern under different application environments；The text and corresponding voice word for calling speech synthesis service processing to be converted Allusion quotation, the voice after generating synthesis.

According to another aspect of an embodiment of the present invention, a kind of equipment for synthesizing voice is additionally provided, comprising: input unit, For receiving text and index information to be converted；Processor for obtaining corresponding phonetic dictionary according to index information, and is adjusted Text to be converted and corresponding phonetic dictionary are handled with voice Composite service, the voice after generating synthesis, wherein different index The corresponding phonetic dictionary of information characterizes the sound producing pattern under different application environments；Pronunciation device, for exporting the language after synthesizing Sound.

According to another aspect of an embodiment of the present invention, a kind of system for synthesizing voice is additionally provided, comprising: headend equipment, Text and index information to be converted for receiving input；Server is connect with headend equipment, for receiving text to be converted Sheet and index information, and the phonetic dictionary obtained according to index information is returned into headend equipment, wherein different index information pair The phonetic dictionary answered characterizes the sound producing pattern under different application environments；Headend equipment is also used to call speech synthesis service processing Text to be converted and corresponding phonetic dictionary, the voice after generating synthesis.

According to another aspect of an embodiment of the present invention, a kind of device for synthesizing voice is additionally provided, comprising: receiving module, For receiving text and index information to be converted；Module is obtained, for obtaining corresponding phonetic dictionary according to index information, In, the corresponding phonetic dictionary of different index information characterizes the sound producing pattern under different application environments；Generation module, for calling Speech synthesis service processing text to be converted and corresponding phonetic dictionary, the voice after generating synthesis.

According to an aspect of an embodiment of the present invention, a kind of method for synthesizing voice is provided, comprising: receive to be converted Text and index information；Corresponding phonetic dictionary is obtained according to index information, wherein the corresponding phonetic dictionary of different index information Characterize the sound producing pattern under different application environments；Text to be converted and corresponding phonetic dictionary are handled, generates and closes Voice after.

According to another aspect of an embodiment of the present invention, a kind of device for synthesizing voice is additionally provided, comprising: receiving unit, For receiving text and index information to be converted；Acquiring unit, for obtaining corresponding phonetic dictionary according to index information, In, the corresponding phonetic dictionary of different index information characterizes the sound producing pattern under different application environments；Generation unit, for treating The text of conversion and corresponding phonetic dictionary are handled, the voice after generating synthesis.

According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, storage medium includes the journey of storage Sequence, wherein the method that equipment where control storage medium executes above-mentioned synthesis voice in program operation.

According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, processor is used to run program, In, program executes the above-mentioned method for synthesizing voice when running.

In embodiments of the present invention, by receiving text and index information to be converted；It is obtained and is corresponded to according to index information Phonetic dictionary, wherein the corresponding phonetic dictionary of different index information characterizes the sound producing pattern under different application environments；It calls Speech synthesis service processing text to be converted and corresponding phonetic dictionary, the voice after generating synthesis, have reached in a language It is the purpose of the voice under different application scene by one text Content Transformation in sound synthesis system, to realize more intelligent With the technical effect of diversified speech synthesis service, and then solves and then solve what existing speech synthesis system generated The not high technical problem of voice accuracy.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is a kind of equipment schematic diagram for synthesizing voice according to an embodiment of the present invention；

Fig. 2 is a kind of optional speech synthesis schematic illustration according to an embodiment of the present invention；

Fig. 3 is a kind of method flow diagram for synthesizing voice according to an embodiment of the present invention；

Fig. 4 is a kind of method flow diagram of optional synthesis voice according to an embodiment of the present invention；

Fig. 5 is a kind of system schematic for synthesizing voice according to an embodiment of the present invention；And

Fig. 6 is a kind of schematic device for synthesizing voice according to an embodiment of the present invention；

Fig. 7 is a kind of hardware block diagram of terminal according to an embodiment of the present invention；

Fig. 8 is a kind of flow chart of method for synthesizing voice according to an embodiment of the present invention；And

Fig. 9 is a kind of schematic device for synthesizing voice according to an embodiment of the present invention.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

Embodiment 1

According to embodiments of the present invention, a kind of apparatus embodiments for synthesizing voice are provided, it should be noted that the present embodiment The apparatus embodiments of the synthesis voice of offer can be for for providing the computer of voice service, mobile phone, e-book, MP3, vehicle-mounted leading The intelligent electronic devices such as boat, or the intelligent robot of artificial intelligence field.The equipment of the synthesis voice can be by oneself Text information generating or receiving is converted to voice signal, and is exported by speech ciphering equipment (for example, player).

The equipment of synthesis voice provided in this embodiment, can support user by text information to be converted according to text Application scenarios (for example, sound novel, storytelling, modern drama, give a lecture, read aloud), voice style are (for example, male voice, female voice, child's voice, deep and remote Silent style, serious style etc.) or the corresponding voice of special-purpose (for example, name, place name etc.) output, to meet user's multiplicity The speech synthesis service of change.

As an alternative embodiment, if the text information generated inside equipment is converted to language using the equipment Sound signal, user can directly select a certain application scenarios, voice style or special-purpose, and it is specified which just exports user Application scenarios, the voice under voice style or special-purpose；If converted using the text information that the equipment inputs user For voice signal, user can input or choose the voice word for being converted to target voice while inputting word content The index information (application scenarios, voice style or special-purpose that the index information is used for specified word content) of allusion quotation, equipment is just By the word content of input with target voice output.

Fig. 1 show a kind of equipment schematic diagram for synthesizing voice according to an embodiment of the present invention.As shown in Figure 1, the equipment 10 include: input unit 101, processor 103 and pronunciation device 105.

Wherein, input unit 101, for receiving text and index information to be converted.

Specifically, above-mentioned text to be converted can be the text information obtained by input unit, and the form of text is not It is confined to Chinese, English, can be the language of any country；Above-mentioned index information can for it is pre-set for index to The identification information of a few phonetic dictionary, for example, it may be the number of phonetic dictionary；The phonetic dictionary can for for will wait turn The text conversion changed is the voice under different application scene, voice style or special-purpose.

Optionally, above-mentioned input unit can be thought as the hardware input equipment such as keyboard, scanning means, handwriting pad, microphone, If input unit is keyboard, equipment 10 directly can receive user by keyboard and input the text to be converted for target voice Information；If input unit is scanning means (for example, scanner or camera), equipment 10 can be identified by scanning first The text information in image that device scans, is converted to corresponding text information, as the text to be converted for target voice This information；If input unit is handwriting pad, equipment 10 obtains corresponding text according to the motion track of user on the jotting surface Word, as the text information to be converted for target voice；If input unit is microphone, equipment 10 is to receive user defeated After the voice entered, voice content is converted into corresponding text information, as the text information to be converted for target voice.As A kind of optional embodiment, above-mentioned text to be converted can be the data obtained from network, for example, when user's use exists When line dictionary for translation, the content of text of a certain language is inputted, then server can return to another kind corresponding with text content The text information of language, and the text information of return is exported in a manner of voice, then above-mentioned input unit can also be for can be with For receiving the communication device of server returns information.

In a kind of optional embodiment, above-mentioned index information can be used for specified by text conversion to be converted being voice At least one phonetic dictionary, the index information can be the identification information of phonetic dictionary, can also be with the number of phonetic dictionary.

As an alternative embodiment, above-mentioned index information can be customized by the user input, it is also possible to use It is selected in the identification information of at least one phonetic dictionary of family according to system suggestion；It can be defeated while inputting word information Enter index information, index information can also be just inputted before inputting text.

As another optional embodiment, it is based on context semantic to can be system for above-mentioned index information, automatic to know It is clipped to the corresponding phonetic dictionary of text information, thus the identification information or number of the corresponding phonetic dictionary automatically selected.

Optionally, above two mode can generate voice using universal phonetic synthesis dictionary in default conditions, work as user Input specific use phonetic dictionary index information or system identification to specific use text information in the case where, acquisition The phonetic dictionary of corresponding specific use carries out speech synthesis.

It is easy it is noted that the embodiment of the application protection includes but is not limited to above embodiment, as long as being related to The synthetic schemes that text information addition label information (i.e. index information) carries out voice under different scenes difference purposes is belonged to The scope of protection of the invention.

Herein it should be noted that the priority of index information can be set, called according to the sequence of priority corresponding Phonetic dictionary, for example, the sequence of priority can be special-purpose, application scenarios, voice style.Special-purpose is called first The phonetic dictionary of (for example, name, place name etc.), it is ensured that the accuracy of word pronunciation under different purposes；Secondly it calls not With the phonetic dictionary (for example, sound novel, storytelling, modern drama, give a lecture, read aloud) of application scenarios, rough voice can be determined Intonation；Finally based on different voice styles (for example, male voice, female voice, child's voice, vein of humour vein, serious style etc.), so that voice It is more diversified.

Processor 103 for obtaining corresponding phonetic dictionary according to index information, and calls speech synthesis service processing to wait for The text of conversion and corresponding phonetic dictionary, the voice after generating synthesis, wherein the corresponding phonetic dictionary table of different index information Levy the sound producing pattern under different application environments.

Specifically, above-mentioned phonetic dictionary can be for for being different application environment, voice wind by text conversion to be converted The sound bank of sound producing pattern under lattice or special-purpose, contain in the sound bank content of text to be converted and with text content Corresponding voice messaging；The text information and index to be converted for target voice is received by input unit 101 in equipment 10 After information, according to the index information of the target voice, phonetic dictionary corresponding with the target voice is got, and voice is called to close At service (TTS), the text information to be converted for target voice is synthesized by corresponding target voice based on the phonetic dictionary.

In a kind of optional embodiment, it is assumed that text to be converted is " mine is named as the chief of the Xiongnu in Acient China ", wherein " chief of the Xiongnu in Acient China " this For word when making name, pronunciation is different from normal articulation, if using existing speech synthesis system, the language that synthesizes Sound result is " wo de ming zi jiao dan yu "；And the above embodiments of the present application are based on, by establishing dedicated for surname The phonetic dictionary (user-oriented dictionary) of name or place name, phonetic dictionary format can be as shown in table 1, as shown in table 1, in synthesis voice In the process, if user is when inputting text " mine is named as the chief of the Xiongnu in Acient China " to be converted, while the index of user-oriented dictionary is inputted Information " 1 ", the then sound result synthesized are " wo3de0ming2zi4jiao4shan4yu2 ", so as to avoid inciting somebody to action " shan4yu2 " misreads into " dan1yu2 "；Wherein, respectively with " 0 ", " 1 ", " 2 ", " 3 ", " 4 " respectively indicate tone be " softly ", " sound ", " two sound ", " three sound ", " four tones of standard Chinese pronunciation ".

1 phonetic dictionary format of table

Number	Word	Mark
			1	The chief of the Xiongnu in Acient China	Shan4yu2
2	Bozhou	Bo2zhou1

In another optional embodiment, it is assumed that text to be converted is " my family come from Bozhou ", wherein " Bozhou " this For word when making place name, pronunciation is different from normal articulation, if using existing speech synthesis system, the language that synthesizes Sound result is " wo jia lai zi hao zhou "；And the user-oriented dictionary as shown in Table 1 based on the above embodiments of the present application, The sound result then synthesized is " wo3jia1lai2zi1bo2zhou1 "；Wherein, respectively with " 0 ", " 1 ", " 2 ", " 3 ", " 4 " difference Indicate that tone is " softly ", " sound ", " two sound ", " three sound ", " four tones of standard Chinese pronunciation ".

Pronunciation device 105, for exporting the voice after synthesizing.

Specifically, above-mentioned pronunciation device can be the player for exporting voice, and text to be converted is synthesized in processor After this voice, generating device is by the voice output after synthesis.

From the foregoing, it will be observed that in the above embodiments of the present application, by the voice word for establishing sound producing pattern under different application environment Allusion quotation receives the text phonetic dictionary corresponding with the target voice of target voice to be converted during carrying out speech synthesis Index information, corresponding with target voice phonetic dictionary is obtained according to the index information, is based on the phonetic dictionary, calling voice Composite service synthesizes the target voice of text to be converted, and the target voice after synthesis is exported, and has reached and has closed in a voice At in system by one text Content Transformation being the purpose of the voice under different application scene, to realize more intelligent and more The technical effect of the speech synthesis service of sample, and then solve and then solve the voice that existing speech synthesis system generates The not high technical problem of accuracy.

In an alternative embodiment, above-mentioned phonetic dictionary is for recording same pronunciation object under different application environments Different pronunciations, wherein pronunciation object includes at least one following: word, word, phrase and sentence.

Specifically, in the above-described embodiments, above-mentioned pronunciation object can be word, the word of composition content of text to be converted Language, phrase or sentence can establish under different application environment, voice style or special-purpose for identical pronunciation object Sound producing pattern constitutes the phonetic dictionary under different application environment, voice style or special-purpose, word content is being synthesized voice During, it can be for word, word, phrase or sentence in word content in different application scenarios, voice style or specific The pronunciation under the application scenarios, voice style or special-purpose is selected under purposes.

It for synthesizing sound novel, is right one small, often relates to different roles (for example, male master, female main etc.), And what is said or talked about language under different scenes (for example, anger, sadness, happiness etc.).Thus, it is based on synthesis language provided in this embodiment The equipment of sound, during text novel is converted to sound novel, can be called from phonetic dictionary library different role or Voice in the phonetic dictionary of different scenes, for example, when word content to be converted is the session in happy situation of " male is main " When content, the label of " male voice, happiness " can be inputted, so as to call voice while input male main session content Pronunciation corresponding with session content (word, word, phrase or the sentence) in " male voice, happiness " dictionary in dictionary, synthesis are final The voice of the session content.

In a kind of optional embodiment, each role and the corresponding gender of each role can be set, is had in synthesis During sound novel, the session content of each role of system automatic identification, and transfer corresponding phonetic dictionary automatically and closed At the phonetic dictionary of specific use can be transferred automatically when recognizing the text information of specific use.Optionally, using certainly Right voice processing technology can go out the text of specific use for context semantics recognition.For example, in default conditions using general Speech synthesis dictionary generates voice, in the case where recognizing the text information (e.g., " mine is named as the chief of the Xiongnu in Acient China ") of specific use, System can synthesize the phrase adjacent with " name " by name phonetic dictionary.

Through the foregoing embodiment, it may be implemented for same word, word, phrase or sentence, according to application scenarios, voice wind Lattice or special-purpose synthesize different pronunciations, to realize diversified pronunciation, enhance user experience.

In an alternative embodiment, as shown in Figure 1, above equipment 10 further include: communication device 107, for uploading The dictinary information of phonetic dictionary is to server, wherein server stores at least one phonetic dictionary, and each phonetic dictionary includes The dictinary information of upload, server generate matched index information, different voices after receiving the dictinary information of upload Dictionary corresponds to different index informations.

Specifically, in the above-described embodiments, before calling speech synthesis service, user can setting by synthesis voice The standby phonetic dictionary for generating different application environment, voice style or special-purpose, and it is uploaded to server, which is receiving To after the dictinary information of upload, matched index information is generated, optionally, which can be the volume being randomly generated Number, it is also possible to the identity information (ID) of user；Since different phonetic dictionaries corresponds to different index informations, thus, user Corresponding index information can be inputted while inputting word content to be converted, then available corresponding phonetic dictionary.

Through the foregoing embodiment, the phonetic dictionary under creation different application environment, voice style or special-purpose is realized The purpose in library, also, corresponding phonetic dictionary on server is obtained according to index information, reduce accounting for for local storage space With.

In an alternative embodiment, after above-mentioned server receives the dictinary information of upload, dictinary information is detected In include pronunciation object format and/or pronunciation whether meet predetermined condition, if it is satisfied, then determination dictinary information is written Corresponding index database.

Specifically, in the above-described embodiments, server is after receiving phonetic dictionary (user-oriented dictionary, UserDict), clothes Legitimacy detection module on business device can check whether the format of dictionary and pronunciation are legal, will after legitimacy detection passes through Dictinary information write service device, and all user-oriented dictionaries are compiled according to index information (for example, mark id) and are formed together Phonetic dictionary library (user-oriented dictionary library, UserDicts).

Through the foregoing embodiment, verifying link is increased, the safety of system is improved.

In an alternative embodiment, above-mentioned processor 103 is also used to be inquired from server according to index information To corresponding phonetic dictionary；Whether the attribute for detecting phonetic dictionary is legal；If legal, it is determined that language corresponding with index information Sound dictionary；If illegal, it is determined that inquiry failure returns to server and carries out inquiry operation, wherein if in pre- timing The number of interior query result failure or inquiry failure is more than pre-determined number, then abandons current inquiry request and export prompt letter Breath.

Specifically, in the above-described embodiments, processor 103 is in the process for obtaining corresponding phonetic dictionary according to index information In, inquire phonetic dictionary corresponding with the index information from server according to the index information of content of text to be converted first, After finding phonetic dictionary corresponding with the index information, whether the attribute for detecting the phonetic dictionary is legal, if legal, Using the phonetic dictionary as phonetic dictionary corresponding with the index information, if the attribute of the phonetic dictionary is illegal, inquiry is lost It loses, then returns to server and inquired, failed if it exceeds the predetermined time still inquires, or the number of inquiry failure is more than Pre-determined number then abandons current inquiry request, and exports prompt information.In a kind of optional embodiment, if do not looked into Corresponding phonetic dictionary is ask, it can be using default pronunciation.

As a kind of optional embodiment, Fig. 2 is that a kind of optional speech synthesis principle according to an embodiment of the present invention is shown It is intended to, as shown in Fig. 2, user is first by its user-oriented dictionary (user-oriented dictionary 1, user before calling speech synthesis (TTS) service Dictionary 1 ... user-oriented dictionary N) it uploads onto the server, server is examined after receiving user-oriented dictionary by legitimacy detection module Whether the format and pronunciation consulted the dictionary are legal, after legitimacy detection, can compile all user-oriented dictionaries according to index information User-oriented dictionary library is formed together, and in synthesis phase, user inputs text to be synthesized in TTS (Text to Speech) engine While input dictionary index information can synthesize the sound result of this style.

In an alternative embodiment, above-mentioned communication device 107 be also used to timing from server download phonetic dictionary to It is local, and the phonetic dictionary downloaded to is cached, so that during obtaining corresponding phonetic dictionary according to index information, if Corresponding phonetic dictionary can not be inquired in local cache, then forwarding inquiries request to obtain corresponding voice word to server Allusion quotation.

Specifically, in the above-described embodiments, 107 timing of communication device of above equipment 10 downloads from a server voice word Allusion quotation is to local and caches, during processor 103 obtains corresponding phonetic dictionary according to index information, if locally slow Corresponding phonetic dictionary can not be inquired in depositing, then it is corresponding to obtain to be forwarded to server by communication device 107 for inquiry request Phonetic dictionary.

Through the foregoing embodiment, using the form of caching, the rate of speech synthesis is improved, and is utilized a large amount of on server Phonetic dictionary carry out speech synthesis, ensure that the validity of speech synthesis.

Embodiment 2

According to embodiments of the present invention, additionally provide a kind of embodiment of the method for synthesizing voice, can be applied to be related to by Text information is converted in the various speech synthesis scenes of voice, such as the speech synthesis that the speech synthesis service of Baidu, news fly Service, thinks the speech synthesis service that must be speeded at the speech synthesis service of Jie Tonghua sound.

Speech synthesis technique is also known as literary periodicals technology, abbreviation TTS (Text to Speech) technology, and major function is By text information that generate computer oneself or externally input (for example, text file content, word document content etc.), Voice signal output is converted to according to speech processes rule.Any text information can be converted in real time normal stream by TTS technology Smooth massage voice reading comes out, and is related to the technology of multiple subjects such as acoustics, linguistics, digital information processing, computer science.Literary language Converting system can actually regard an artificial intelligence system as.In order to synthesize the language of high quality, in addition to dependent on each Kind rule, including semantics rule, lexical rule, phonetics rule are outer, it is necessary to be well understood by having in text.

With the increasingly increase of speech synthesis demand, people's pairing at voice requirement increasingly diversity.Different Under application environment, prosodic parameter be all it is different, the requirement with people to the naturalness and sound quality of speech synthesis is more next Higher, speech synthesis system should generate personalized, diversified voice, to meet different application scenarios.

On the other hand, due to for same text, the pronunciation for having its different under special-purpose, for example, " one " this word, It is read when individually reading；When being placed on composition word behind word, a sound is read；Being placed in the word formed before word will read to become It adjusts, two sound is read before the four tones of standard Chinese pronunciation, read the four tones of standard Chinese pronunciation before one, two, three sound；In another example some words are a pronunciation in routine use, when It is a pronunciation again when making name.

And in existing speech synthesis system, a word, phrase or usually only a kind of pronunciation of sentence were both unable to satisfy Different application scenarios in the case where some special-purposes, or even can issue the pronunciation of mistake.

Under above-mentioned application environment, this application provides a kind of methods of synthesis voice as shown in Figure 3.Based on this method The speech synthesis system of embodiment, can support user by text information to be converted according to the application scenarios of text (for example, having Sound novel, modern drama, gives a lecture, reads aloud at storytelling), voice style is (for example, male voice, female voice, child's voice, vein of humour vein, serious style Deng) or the corresponding voice of special-purpose (for example, name, place name etc.) output, to meet the diversified speech synthesis clothes of user Business.

Fig. 3 is a kind of flow chart of method for synthesizing voice according to an embodiment of the present invention, it should be noted that in attached drawing Process the step of illustrating can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or The step of description.As shown in figure 3, including the following steps:

Step S302 receives text and index information to be converted.

Specifically, in above-mentioned steps, above-mentioned text to be converted can be the text information obtained by input unit, The form of text is not limited to Chinese, English, can be the language of any country；Above-mentioned index information can be pre-set For indexing the identification information of at least one phonetic dictionary, which can be for for being by text conversion to be converted Voice under different application scene, voice style or special-purpose.

Herein it should be noted that during synthesizing voice, above-mentioned index information is used for text rope to be converted Different phonetic dictionaries is guided to, as an alternative embodiment, if above-mentioned text to be converted is the text of user's input Word information, then above-mentioned index information can be user and be customized by the user input, example while inputting text to be converted Such as, during user inputs " mine is named as the chief of the Xiongnu in Acient China ", due to pronunciation when " chief of the Xiongnu in Acient China " is as name and under normal circumstances not Together, thus, user after input " chief of the Xiongnu in Acient China ", can the above or below of " chief of the Xiongnu in Acient China " input dedicated for name phonetic dictionary Identification information " 1 ".It is alternatively possible at least one phonetic dictionary that can prompt user currently available on input interface with And the identification information of each phonetic dictionary, it is selected for user.

Since the text for synthesizing voice is not necessarily user's input, it is also possible to be the text that computer-internal generates Word information, thus, as another optional embodiment, above-mentioned index information can also be the semanteme of system based on context It analyzes and sets automatically, it is alternatively possible to based on natural language analysis technology come automatic identification text information.For example, when using When family uses translation on line dictionary, the content of text of a certain language is inputted, then server can return corresponding with text content Another language text information, if return text information be to be exported in a manner of voice, system can be known automatically Not Ji Suan the internal text generated purposes or application scenarios and voice style etc., and automatically select corresponding index information, The text that computer-internal generates is input to speech synthesis system with index information to synthesize.

Herein it should also be noted that, being directed to the first above-mentioned optional embodiment, user is needed to input index information In the case where, default conditions can generate voice using universal phonetic synthesis dictionary, and user only needs inputting some special use When the text information of way or application scenarios, corresponding phonetic dictionary is inputted while inputting these text informations.For example, When user needs to synthesize a voice about the article of self-introduction, first select one default phonetic dictionary (for example, General " female voice " phonetic dictionary), general " female voice " phonetic dictionary that default is all made of during inputting text is synthesized, If encountering the text information of specific use, for example, name or place name, user can input while inputting name or place name The identification information of corresponding name phonetic dictionary or place name phonetic dictionary after the text information for having synthesized name or place name, continues It is synthesized using general " female voice " phonetic dictionary of default, so that user be avoided to repeatedly input the troublesome operation of index information. For above-mentioned second optional embodiment, voice can also be generated using universal phonetic synthesis dictionary in default conditions, when In the case where the text information for recognizing specific use, the phonetic dictionary for obtaining corresponding specific use carries out speech synthesis.

In a kind of optional embodiment, above-mentioned text can be user directly inputted by keyboard it is to be converted for target language The text information of sound；The text being also possible in the image scanned by scanning means is converted to phase by identifying processing The text information answered；It is also possible to text information obtained from the text of the writing of user on the jotting surface；It can also be user After inputting voice by microphone lamp signal mixer, text information that voice content is converted into.As a kind of optional implementation Scheme, above-mentioned text to be converted can also be the data obtained from network, for example, when user uses translation on line dictionary When, the content of text of a certain language is inputted, then server can return to the text of another language corresponding with text content Information, and the text information of return is exported in a manner of voice.

Step S304 obtains corresponding phonetic dictionary according to index information, wherein the corresponding voice word of different index information Allusion quotation characterizes the sound producing pattern under different application environments.

Specifically, in above-mentioned steps, above-mentioned phonetic dictionary can be for for being that difference is answered by text conversion to be converted With the sound bank of sound producing pattern under environment, voice style or special-purpose, content of text to be converted is contained in the sound bank With voice messaging corresponding with text content；After receiving the text information and index information to be converted for target voice, According to the index information of the target voice, phonetic dictionary corresponding with the target voice is got, wherein phonetic dictionary can be The phonetic dictionary being locally stored, the phonetic dictionary being also possible on server.

Step S306 calls speech synthesis service processing text to be converted and corresponding phonetic dictionary, after generating synthesis Voice.

Specifically, in above-mentioned steps, after getting phonetic dictionary corresponding with the target voice according to index information, It calls speech synthesis service (TTS), is based on the phonetic dictionary, by the text information to be converted for target voice, synthesis is corresponding Target voice.

1 phonetic dictionary format of table

Herein it should be noted that above-mentioned index information can correspond to a word in text to be converted, word, phrase or Sentence, alternatively it is also possible to correspond to entire text to be converted, it is different according to concrete application scene or preset condition, it can To realize different index functions.Wherein, when index information corresponds only to a word, word, phrase or the sentence of text to be converted In the case where, index information is added in the above or below for inputting the word, word, phrase or sentence, then the index information indexes Phonetic dictionary be served only for carrying out speech synthesis to current word, word, phrase or sentence, other parts are then adopted in text to be converted Speech synthesis is carried out with the phonetic dictionary of default.In order to further discriminate between index information for any partial words, word, phrase or sentence Son is synthesized, and in a kind of optional embodiment, the language indexed using the index information can be added in index information The number of words or starting, terminal text of sound dictionary progress speech synthesis；In another optional embodiment, one can be used A little additional characters (for example, bracket or quotation marks) carry out speech synthesis using the phonetic dictionary that the index information indexes to distinguish Text.

For example, text to be converted is " mine is named as the chief of the Xiongnu in Acient China, is reading the books of a separate edition ", under default situations Index can be added when being input to " chief of the Xiongnu in Acient China " before " chief of the Xiongnu in Acient China " by carrying out speech synthesis using universal phonetic dictionary It is come out to the index information of name phonetic dictionary, and by " chief of the Xiongnu in Acient China " this word bracket or quotation marks, then system is in translation " I Be named as the chief of the Xiongnu in Acient China, reading the books of a separate edition " the words when, only when translation " chief of the Xiongnu in Acient China " this word It is the name phonetic dictionary utilized, other word segments still carry out speech synthesis using universal phonetic dictionary, so as to incite somebody to action Two " list " are utilized respectively different phonetic dictionary synthesis not in " mine is named as the chief of the Xiongnu in Acient China, is reading the books of a separate edition " Same voice.

Optionally, system can also be to need to carry out voice conjunction using specific human voices dictionary in automatic identification text to be converted At word, word, phrase or sentence, then by the text of preset quantity behind the word, word, phrase or sentence utilize index information rope The specific human voices dictionary guided to carries out speech synthesis, and the text of other parts is carried out using Default sound dictionary in text to be converted Speech synthesis.For example, making after recognizing " name " in " mine is named as the chief of the Xiongnu in Acient China, is reading the books of a separate edition " For a kind of optional embodiment, the entire sentence comprising " name " can be subjected to voice conjunction using name phonetic dictionary automatically At other word segments still carry out speech synthesis using universal phonetic dictionary, so as to by " mine is named as the chief of the Xiongnu in Acient China, just In the books for reading a separate edition " in two " list " be utilized respectively different phonetic dictionaries and synthesize different voices.

In an alternative embodiment, before obtaining corresponding phonetic dictionary according to index information, the above method is also May include steps of: step S303 uploads the dictinary information of phonetic dictionary to server, wherein server store to A few phonetic dictionary, each phonetic dictionary include the dictinary information uploaded, server the dictinary information for receiving upload it Afterwards, matched index information is generated, different phonetic dictionaries corresponds to different index informations.

In an alternative embodiment, as shown in figure 4, obtaining corresponding phonetic dictionary according to index information, including such as Lower step:

Step S402 is inquired from server according to index information and is obtained corresponding phonetic dictionary；

Whether step S404, the attribute for detecting phonetic dictionary are legal；

Step S406, if legal, it is determined that phonetic dictionary corresponding with index information；

Step S408, if illegal, it is determined that inquiry failure returns to server and carries out inquiry operation, wherein such as The number of query result failure or inquiry failure is more than pre-determined number to fruit in the given time, then abandons current inquiry request simultaneously Export prompt information.

Specifically, in the above-described embodiments, during obtaining corresponding phonetic dictionary according to index information, root first Phonetic dictionary corresponding with the index information is inquired from server according to the index information of content of text to be converted, find with After the corresponding phonetic dictionary of the index information, whether the attribute for detecting the phonetic dictionary is legal, if legal, by the voice word Allusion quotation is as phonetic dictionary corresponding with the index information, if the attribute of the phonetic dictionary is illegal, inquiry failure is then returned again It returns server to be inquired, fail if it exceeds the predetermined time still inquires, or the number of inquiry failure is more than pre-determined number, then Current inquiry request is abandoned, and exports prompt information.In a kind of optional embodiment, if not inquiring corresponding language Sound dictionary, can be using default pronunciation.

As a kind of optional embodiment, as shown in Fig. 2, user is first by it before calling speech synthesis (TTS) service User-oriented dictionary (user-oriented dictionary 1, user-oriented dictionary 1 ... user-oriented dictionary N) is uploaded onto the server, server receive user-oriented dictionary it Afterwards, check whether the format of dictionary and pronunciation are legal by legitimacy detection module, it, can be by all use after legitimacy detection Family dictionary is compiled according to index information is formed together user-oriented dictionary library, and in synthesis phase, user is in TTS (Text to Speech the index information that dictionary is inputted while) inputting text to be synthesized in engine can synthesize the voice knot of this style Fruit.

In an alternative embodiment, the above method further include: step S502 periodically downloads phonetic dictionary from server To local, and the phonetic dictionary downloaded to is cached, so that during obtaining corresponding phonetic dictionary according to index information, such as Fruit can not inquire corresponding phonetic dictionary in local cache, then forwarding inquiries request to obtain corresponding voice to server Dictionary.

Specifically, in the above-described embodiments, phonetic dictionary is periodically downloaded from a server to local and cached, according to rope During drawing the corresponding phonetic dictionary of acquisition of information, if corresponding phonetic dictionary can not be inquired in local cache, Inquiry request is forwarded to server to obtain corresponding phonetic dictionary.

By scheme disclosed in the above-mentioned each embodiment of the application, following technical effect may be implemented: one, pass through different fields The method that scape corresponds to different user dictionary realizes that different word, phrase and sentences send out sound specific；Two, calling speech synthesis service When synthesizing voice, corresponding pronunciation index information is inputted while inputting text to be synthesized, to realize diversified pronunciation.

Embodiment 3

According to embodiments of the present invention, a kind of system embodiment for synthesizing voice is additionally provided, Fig. 5 is to implement according to the present invention The system schematic of a kind of synthesis voice of example, as shown in figure 5, the system includes: headend equipment 501 and server 503.

Wherein, headend equipment 501, text and index information to be converted for receiving input；

Server 503, connect with headend equipment, for receiving text and index information to be converted, and will be according to index The phonetic dictionary of acquisition of information returns to headend equipment, wherein the corresponding phonetic dictionary of different index information characterizes different answer With the sound producing pattern under environment；

The text and corresponding voice word that above-mentioned headend equipment 501 is also used to call speech synthesis service processing to be converted Allusion quotation, the voice after generating synthesis.

Specifically, above-mentioned headend equipment can may be used to provide voice with computer, notebook, tablet computer, mobile phone etc. The Intelligent mobile equipment of service；User can input text to be converted and specified application scenarios, voice wind by headend equipment The index information of lattice or special-purpose, and server is sent to by headend equipment, server receive text to be converted and After index information, corresponding phonetic dictionary is obtained according to index information, and the phonetic dictionary is back to headend equipment, front end is set The standby voice for calling TTS service to synthesize text to be converted using the phonetic dictionary that server returns.

In an alternative embodiment, the dictinary information that above-mentioned headend equipment 501 is also used to upload phonetic dictionary extremely takes Business device, wherein server stores at least one phonetic dictionary, and each phonetic dictionary includes the dictinary information uploaded, server After receiving the dictinary information of upload, matched index information is generated, different phonetic dictionaries corresponds to different index letters Breath.

In an alternative embodiment, after above-mentioned server 503 receives the dictinary information of upload, detection dictionary letter Whether the format for the pronunciation object for including in breath and/or pronunciation meet predetermined condition, if it is satisfied, then dictinary information is write in determination Enter corresponding index database.

In an alternative embodiment, above-mentioned headend equipment 501 is also used to be inquired from server according to index information Obtain corresponding phonetic dictionary；Whether the attribute for detecting phonetic dictionary is legal；If legal, it is determined that corresponding with index information Phonetic dictionary；If illegal, it is determined that inquiry failure returns to server and carries out inquiry operation, wherein if predetermined The number of query result failure or inquiry failure is more than pre-determined number in time, then abandons current inquiry request and export prompt Information.

In an alternative embodiment, above-mentioned headend equipment 501 be also used to timing from server download phonetic dictionary to It is local, and the phonetic dictionary downloaded to is cached, so that during obtaining corresponding phonetic dictionary according to index information, if Corresponding phonetic dictionary can not be inquired in local cache, then forwarding inquiries request to obtain corresponding voice word to server Allusion quotation.

Embodiment 4

According to embodiments of the present invention, additionally provide it is a kind of for implementing the Installation practice of the method for above-mentioned synthesis voice, Fig. 6 is a kind of schematic device for synthesizing voice according to an embodiment of the present invention, as shown in fig. 6, the device includes: receiving module 601, module 603 and generation module 605 are obtained.

Wherein, receiving module 601, for receiving text and index information to be converted；

Module 603 is obtained, for obtaining corresponding phonetic dictionary according to index information, wherein different index information is corresponding Phonetic dictionary characterize the sound producing pattern under different application environments；

Generation module 605, text and corresponding phonetic dictionary for calling speech synthesis service processing to be converted generate Voice after synthesis.

Herein it should be noted that above-mentioned receiving module 601, acquisition module 603 and generation module 605 can correspond to reality The step S302 to step S306 in example 2 is applied, three modules are identical as example and application scenarios that corresponding step is realized, but It is not limited to the above embodiments 2 disclosure of that.

In an alternative embodiment, above-mentioned apparatus further include: uploading module, the dictionary for uploading phonetic dictionary are believed It ceasing to server, wherein server stores at least one phonetic dictionary, and each phonetic dictionary includes the dictinary information uploaded, Server generates matched index information after receiving the dictinary information of upload, and different phonetic dictionaries corresponds to different Index information.

Herein it should be noted that uploading module can correspond to the step S303 in embodiment 2, the module with it is corresponding The example that step is realized is identical with application scenarios, but is not limited to the above embodiments 2 disclosure of that.

In an alternative embodiment, above-mentioned acquisition module further include: enquiry module, for according to index information from clothes Inquiry obtains corresponding phonetic dictionary in business device；Whether detection module, the attribute for detecting phonetic dictionary are legal；First executes Module, if for legal, it is determined that phonetic dictionary corresponding with index information；Second execution module, if for illegal, Then determine inquiry failure, return to server carry out inquiry operation, wherein if in the given time query result failure or The number of inquiry failure is more than pre-determined number, then abandons current inquiry request and export prompt information.

Herein it should be noted that enquiry module, detection module, the first execution module and the second execution module can correspond to Step S402 to step S408 in embodiment 2, the example and application scenarios phase that four modules are realized with corresponding step Together, but 2 disclosure of that are not limited to the above embodiments.

In an alternative embodiment, above-mentioned apparatus is also used to timing from server downloading phonetic dictionary to local, and The phonetic dictionary downloaded to is cached, so that during obtaining corresponding phonetic dictionary according to index information, if in local Corresponding phonetic dictionary can not be inquired in caching, then forwarding inquiries request to obtain corresponding phonetic dictionary to server.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of the synthesis voice of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hard Part, but the former is more preferably embodiment in many cases.Based on this understanding, technical solution of the present invention substantially or Say that the part that contributes to existing technology can be embodied in the form of software products, which is stored in In one storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be hand Machine, computer, server or network equipment etc.) execute method described in each embodiment of the present invention.

Embodiment 5

The embodiment of the present invention can provide a kind of terminal, which can be in terminal group Any one computer terminal.Optionally, in the present embodiment, above-mentioned terminal also could alternatively be mobile whole The terminal devices such as end.

Optionally, in the present embodiment, above-mentioned terminal can be located in multiple network equipments of computer network At least one network equipment.

Fig. 7 shows a kind of hardware block diagram of terminal.As shown in fig. 7, terminal 70 may include (processor 702 may include but not for one or more (to use 702a, 702b ... ... in figure, 702n to show) processor 702 Be limited to the processing unit of Micro-processor MCV or programmable logic device FPGA etc.), memory 704 for storing data and Transmitting device 706 for communication function.In addition to this, can also include: display, input/output interface (I/O interface), Port universal serial bus (USB) (a port that can be used as in the port of I/O interface is included), network interface, power supply And/or camera.It will appreciated by the skilled person that structure shown in Fig. 7 is only to illustrate, above-mentioned electronics is not filled The structure set causes to limit.For example, terminal 70 may also include than shown in Fig. 7 more perhaps less component or With the configuration different from shown in Fig. 7.

It is to be noted that said one or multiple processors 702 and/or other data processing circuits lead to herein Can often " data processing circuit " be referred to as.The data processing circuit all or part of can be presented as software, hardware, firmware Or any other combination.In addition, data processing circuit for single independent processing module or all or part of can be integrated to meter In any one in other elements in calculation machine terminal 70.As involved in the embodiment of the present application, data processing electricity Road controls (such as the selection for the variable resistance end path connecting with interface) as a kind of processor.

Processor 702 can call the information and application program of memory storage by transmitting device, to execute following steps It is rapid: to obtain the sliding window sequence of key, wherein sliding window sequence includes: the multiple sliding windows for obtain after slide window processing to key；It is right At least one sliding window in sliding window sequence carries out scrambling processing, the sliding window sequence after being scrambled；Sliding window sequence after traversal scrambling Column, post-process the sliding window sequence after scrambling using Montgomery modular multiplier.

Memory 704 can be used for storing the software program and module of application software, such as the key in the embodiment of the present invention The corresponding program instruction/data storage device of processing method, processor 702 by operation be stored in it is soft in memory 704 Part program and module realize the key of above-mentioned application program thereby executing various function application and data processing Processing method.Memory 704 may include high speed random access memory, may also include nonvolatile memory, such as one or more Magnetic storage device, flash memory or other non-volatile solid state memories.In some instances, memory 704 can be wrapped further The memory remotely located relative to processor 702 is included, these remote memories can pass through network connection to terminal 70.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Transmitting device 706 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of terminal 70 provide.In an example, transmitting device 706 includes that a network is suitable Orchestration (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to Internet is communicated.In an example, transmitting device 706 can be radio frequency (Radio Frequency, RF) module, For wirelessly being communicated with internet.

Display can such as touch-screen type liquid crystal display (LCD), the liquid crystal display aloow user with The user interface of terminal 70 interacts.

Herein it should be noted that in some optional embodiments, above-mentioned terminal 70 shown in Fig. 7 may include Hardware element (including circuit), software element (including the computer code that may be stored on the computer-readable medium) or hardware member The combination of both part and software element.It should be pointed out that Fig. 7 is only an example of particular embodiment, and it is intended to show It may be present in the type of the component in above-mentioned terminal 70 out.

In the present embodiment, above-mentioned terminal 70 can be with following step in the method for the synthesis voice of executing application Rapid program code: problem currently entered is received；At least one candidate answers of problem are obtained based on retrieval model, and are based on Generate the first answer that model obtains problem, wherein retrieval model is the model that result is obtained based on search technique, generates model For the model for obtaining result based on training pattern；Assessment processing is carried out according at least to the first answer and at least one candidate answers, The output answer of generation problem.

Optionally, processor can call the information and application program of memory storage by transmitting device, under executing It states step: receiving text and index information to be converted；Corresponding phonetic dictionary is obtained according to index information, wherein different ropes Fuse ceases corresponding phonetic dictionary and characterizes sound producing pattern under different application environments；Call speech synthesis service processing to be converted Text and corresponding phonetic dictionary, generate synthesis after voice.

Optionally, above-mentioned phonetic dictionary is used to record the different pronunciations of same pronunciation object under different application environments, In, pronunciation object includes at least one following: word, word, phrase and sentence.

Optionally, the program code of following steps can also be performed in above-mentioned processor: uploading the dictinary information of phonetic dictionary To server, wherein server stores at least one phonetic dictionary, and each phonetic dictionary includes the dictinary information uploaded, clothes Device be engaged in after receiving the dictinary information of upload, generates matched index information, different phonetic dictionaries corresponds to different ropes Fuse breath.

Optionally, after server receives the dictinary information of upload, following steps are can also be performed in above-mentioned processor Program code: whether the format for the pronunciation object for including in detection dictinary information and/or pronunciation meet predetermined condition, if full Foot, it is determined that corresponding index database is written into dictinary information.

Optionally, the program code of following steps can also be performed in above-mentioned processor: according to index information from server Inquiry obtains corresponding phonetic dictionary；Whether the attribute for detecting phonetic dictionary is legal；If legal, it is determined that with index information pair The phonetic dictionary answered；If illegal, it is determined that inquiry failure returns to server and carries out inquiry operation, wherein if The number of query result failure or inquiry failure is more than pre-determined number in predetermined time, then abandons current inquiry request and output Prompt information.

Optionally, the program code of following steps can also be performed in above-mentioned processor: periodically downloading voice word from server Allusion quotation caches the phonetic dictionary downloaded to local, so that during obtaining corresponding phonetic dictionary according to index information, If can not inquire corresponding phonetic dictionary in local cache, forwarding inquiries request to obtain corresponding language to server Sound dictionary.

It will appreciated by the skilled person that structure shown in Fig. 7 is only to illustrate, terminal is also possible to intelligence It can mobile phone (such as Android phone, iOS mobile phone), tablet computer, applause computer and mobile internet device (Mobile Internet Devices, MID), the terminal devices such as PAD.Fig. 7 it does not cause to limit to the structure of above-mentioned electronic device.Example Such as, terminal 70 may also include the more or less component (such as network interface, display device) than shown in Fig. 7, Or with the configuration different from shown in Fig. 7.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing the relevant hardware of terminal device by program, which can store in a computer readable storage medium In, storage medium may include: flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..

Embodiment 6

The embodiments of the present invention also provide a kind of storage mediums.Optionally, in the present embodiment, above-mentioned storage medium can To synthesize program code performed by the method for voice provided by above-described embodiment one for saving.

Optionally, in the present embodiment, above-mentioned storage medium can be located in computer network in computer terminal group In any one terminal, or in any one mobile terminal in mobile terminal group.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: connecing Receive text and index information to be converted；Corresponding phonetic dictionary is obtained according to index information, wherein different index information is corresponding Phonetic dictionary characterize the sound producing pattern under different application environments；Call speech synthesis service processing text to be converted and right The phonetic dictionary answered, the voice after generating synthesis.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: The dictinary information of phonetic dictionary is passed to server, wherein server stores at least one phonetic dictionary, each phonetic dictionary packet The dictinary information of upload is included, server generates matched index information, different languages after receiving the dictinary information of upload Sound dictionary corresponds to different index informations.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: inspection Whether the format and/or pronunciation for surveying the pronunciation object for including in dictinary information meet predetermined condition, if it is satisfied, then determining word Corresponding index database is written in allusion quotation information.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: root It is inquired from server according to index information and obtains corresponding phonetic dictionary；Whether the attribute for detecting phonetic dictionary is legal；If closed Method, it is determined that phonetic dictionary corresponding with index information；If illegal, it is determined that inquiry failure, return to server into Row inquiry operation, wherein if the number of query result failure or inquiry failure is more than pre-determined number in the given time, throw It abandons current inquiry request and exports prompt information.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: fixed When from server download phonetic dictionary to local, and the phonetic dictionary downloaded to is cached, so that obtaining pair according to index information During the phonetic dictionary answered, if corresponding phonetic dictionary can not be inquired in local cache, forwarding inquiries request Corresponding phonetic dictionary is obtained to server.

Embodiment 7

The embodiments of the present invention also provide a kind of embodiments of the method for synthesizing voice, can be applied to be related to text Information is converted in the various speech synthesis systems or equipment of voice, including but not limited to the application scenarios of embodiment 1.It needs Bright, step shown in the flowchart of the accompanying drawings can be held in a computer system such as a set of computer executable instructions Row, although also, logical order is shown in flow charts, and it in some cases, can be to be different from sequence herein Execute shown or described step.

Fig. 8 is a kind of flow chart of method for synthesizing voice according to an embodiment of the present invention, as shown in figure 8, including as follows Step:

Step S802 receives text and index information to be converted；

Step S804 obtains corresponding phonetic dictionary according to index information, wherein the corresponding voice word of different index information Allusion quotation characterizes the sound producing pattern under different application environments；

Step S806 handles text to be converted and corresponding phonetic dictionary, the voice after generating synthesis.

Specifically, in above-mentioned steps, above-mentioned text to be converted can be the text inputted by input units such as keyboards This information is also possible to the text information of computer-internal generation, can also be certain some application or the service (example based on Web Such as, Baidu translation translation) return text information, the form of text be not limited to Chinese, English, can be it is any country Language；Above-mentioned index information can be above-mentioned to be pre-set for indexing the identification information of one or more phonetic dictionaries Phonetic dictionary can be for for being the mould that pronounces under different application environment, voice style or special-purpose by text conversion to be converted The sound bank of formula contains content of text and voice messaging corresponding with text content to be converted in the sound bank；It is connecing After receiving the text information and index information to be converted for target voice, according to the index information of the target voice, get with Then the corresponding phonetic dictionary of the target voice is handled text to be converted and corresponding phonetic dictionary, generating should be to The corresponding voice of the text of conversion.

It should be noted that being, phonetic dictionary can be the phonetic dictionary being locally stored, the language being also possible on server Sound dictionary.

From the foregoing, it will be observed that in the above embodiments of the present application, by the voice word for establishing sound producing pattern under different application environment Allusion quotation receives the text phonetic dictionary corresponding with the target voice of target voice to be converted during carrying out speech synthesis Index information, corresponding with target voice phonetic dictionary is obtained according to the index information, is based on the phonetic dictionary, synthesize to turn The target voice of exchange of notes sheet, and the target voice after synthesis is exported, reach same text in a speech synthesis system This Content Transformation is the purpose of the voice under different application scene, to realize more intelligence and diversified speech synthesis clothes The technical effect of business, and then solve and then solve the not high technology of voice accuracy that existing speech synthesis system generates Problem.

Embodiment 8

According to embodiments of the present invention, additionally provide it is a kind of for implementing the Installation practice of the method for above-mentioned synthesis voice, Fig. 9 is a kind of schematic device for synthesizing voice according to an embodiment of the present invention, as shown in figure 9, the device includes: receiving unit 901, acquiring unit 903 and generation unit 905.

Wherein, receiving unit 901, for receiving text and index information to be converted；

Acquiring unit 903, for obtaining corresponding phonetic dictionary according to index information, wherein different index information is corresponding Phonetic dictionary characterize the sound producing pattern under different application environments；

Generation unit 905, for handling text to be converted and corresponding phonetic dictionary, the language after generating synthesis Sound.

Herein it should be noted that above-mentioned receiving unit 901, acquiring unit 903 and generation unit 905 can correspond to reality The step S802 to step S806 in example 7 is applied, three modules are identical as example and application scenarios that corresponding step is realized, but It is not limited to the above embodiments 7 disclosure of that.

From the foregoing, it will be observed that in the above embodiments of the present application, by the voice word for establishing sound producing pattern under different application environment Allusion quotation receives the text and the target voice of target voice to be converted by receiving unit 901 during carrying out speech synthesis The index information of corresponding phonetic dictionary obtains language corresponding with target voice according to the index information by acquiring unit 903 Sound dictionary is based on the phonetic dictionary finally by generation unit 905, synthesizes the target voice of text to be converted, reached one In a speech synthesis system by one text Content Transformation be different application scene under voice purpose, to realize more The technical effect of intelligent and diversified speech synthesis service, and then solve and then solve existing speech synthesis system life At the not high technical problem of voice accuracy.

Embodiment 9

According to embodiments of the present invention, a kind of system embodiment is additionally provided, which includes: processor；And memory, It is connect with processor, for providing the instruction for handling following processing step for processor:

Receive text and index information to be converted；

Corresponding phonetic dictionary is obtained according to index information, wherein the corresponding phonetic dictionary characterization of different index information is not Sound producing pattern under same application environment；

Speech synthesis service processing text to be converted and corresponding phonetic dictionary are called, the voice after generating synthesis.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of equipment for synthesizing voice characterized by comprising

Input unit, for receiving text and index information to be converted；

Processor for obtaining corresponding phonetic dictionary according to the index information, and calls described in speech synthesis service processing Text to be converted and corresponding phonetic dictionary, the voice after generating synthesis, wherein the corresponding phonetic dictionary of different index information Characterize the sound producing pattern under different application environments；

Pronunciation device, for exporting the voice after the synthesis.

2. a kind of method for synthesizing voice characterized by comprising

Receive text and index information to be converted；

Corresponding phonetic dictionary is obtained according to the index information, wherein the corresponding phonetic dictionary characterization of different index information is not Sound producing pattern under same application environment；

Text to be converted and corresponding phonetic dictionary described in speech synthesis service processing are called, the voice after generating synthesis.

3. according to the method described in claim 2, it is characterized in that, the phonetic dictionary is for recording under different application environments The different pronunciations of same pronunciation object, wherein the pronunciation object includes at least one following: word, word, phrase and sentence.

4. according to the method described in claim 2, it is characterized in that, obtaining corresponding phonetic dictionary according to the index information Before, the method also includes:

The dictinary information of phonetic dictionary is uploaded to server, wherein the server stores at least one phonetic dictionary, each Phonetic dictionary includes the dictinary information uploaded, and the server generates matched rope after receiving the dictinary information of upload Fuse breath, different phonetic dictionaries correspond to different index informations.

5. according to the method described in claim 4, it is characterized in that, the server is in the dictinary information for receiving upload Later, whether the format and/or pronunciation for detecting the pronunciation object for including in the dictinary information meet predetermined condition, if full Foot, it is determined that corresponding index database is written into the dictinary information.

6. method according to claim 4 or 5, which is characterized in that obtain corresponding voice word according to the index information Allusion quotation, comprising:

It is inquired from the server according to the index information and obtains corresponding phonetic dictionary；

Whether the attribute for detecting the phonetic dictionary is legal；

If legal, it is determined that phonetic dictionary corresponding with the index information；

If illegal, it is determined that inquiry failure returns to the server and carries out inquiry operation, wherein if predetermined The number of query result failure or inquiry failure is more than pre-determined number in time, then abandons current inquiry request and export prompt Information.

7. according to the method described in claim 4, it is characterized in that, timing downloads phonetic dictionary to local from the server, And the phonetic dictionary downloaded to is cached, so that during obtaining corresponding phonetic dictionary according to the index information, if Corresponding phonetic dictionary can not be inquired in local cache, then forwarding inquiries request to obtain the corresponding language to server Sound dictionary.

8. a kind of system for synthesizing voice characterized by comprising

Headend equipment, text and index information to be converted for receiving input；

Server is connect with the headend equipment, for receiving the text and index information to be converted, and will be according to described The phonetic dictionary that index information obtains returns to headend equipment, wherein the corresponding phonetic dictionary characterization of different index information is different Application environment under sound producing pattern；

The headend equipment is also used to call text to be converted described in speech synthesis service processing and corresponding phonetic dictionary, raw At the voice after synthesis.

9. a kind of device for synthesizing voice characterized by comprising

Receiving module, for receiving text and index information to be converted；

Module is obtained, for obtaining corresponding phonetic dictionary according to the index information, wherein the corresponding language of different index information Sound dictionary characterizes the sound producing pattern under different application environments；

Generation module is generated and is closed for calling text to be converted described in speech synthesis service processing and corresponding phonetic dictionary Voice after.

10. device according to claim 9, which is characterized in that the acquisition module includes:

Enquiry module obtains corresponding phonetic dictionary for inquiring from server according to the index information；

Whether detection module, the attribute for detecting the phonetic dictionary are legal；

First execution module, if for legal, it is determined that phonetic dictionary corresponding with the index information；

Second execution module, if for illegal, it is determined that inquiry failure returns to the server and carries out inquiry behaviour Make, wherein if the number of query result failure or inquiry failure is more than pre-determined number in the given time, abandon current Inquiry request simultaneously exports prompt information.

11. a kind of method for synthesizing voice characterized by comprising

Receive text and index information to be converted；

The text to be converted and corresponding phonetic dictionary are handled, the voice after generating synthesis.

12. a kind of device for synthesizing voice characterized by comprising

Receiving unit, for receiving text and index information to be converted；

Acquiring unit, for obtaining corresponding phonetic dictionary according to the index information, wherein the corresponding language of different index information Sound dictionary characterizes the sound producing pattern under different application environments；

Generation unit, for handling the text to be converted and corresponding phonetic dictionary, the voice after generating synthesis.

13. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require any one of 2 to 7 described in synthesis voice method.

14. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 2 to 7 described in synthesize voice method.

15. a kind of system for synthesizing voice characterized by comprising

Processor；And

Memory is connected to the processor, for providing the instruction for handling following processing step for the processor:

Receive text and index information to be converted；