CN102324231A

CN102324231A - Game dialogue voice synthesizing method and system

Info

Publication number: CN102324231A
Application number: CN201110251459A
Authority: CN
Inventors: 李健; 刘畅; 武卫东
Original assignee: JIETONG HUASHENG SPEECH TECHNOLOGY Co Ltd
Current assignee: JIETONG HUASHENG SPEECH TECHNOLOGY Co Ltd
Priority date: 2011-08-29
Filing date: 2011-08-29
Publication date: 2012-01-18

Abstract

The application of the invention provides a game dialogue voice synthesizing method and a system, belonging to the field of game voice synthesis. The method comprises the steps of: transmitting a dialogue text input by a user to speech cloud through a game application side and an interface; conducting natural language processing according to the dialogue text obtained by the speech cloud to obtain rhyme structure information corresponding to the dialogue text input by the user; synthesizing mashup speech data corresponding to the dialogue text aiming at the obtained rhyme structure information and in combination with the mashup custom speech database of the speech cloud; transmitting the obtained mashup speech data to the game application side through the interface; and playing the received mashup speech data through the application side. By adopting the method and the system provided by the application of the invention, the dialogue text input by the user on the game application side can be played through various voice styles and therefore the friendliness of a game is greatly increased.

Description

A kind of game sessions speech synthesizing method and system

Technical field

The application belongs to the synthetic field of game sound, particularly a kind of game sessions speech synthesizing method and system.

Background technology

Along with the high speed development at terminals such as smart mobile phone, Pad, add popularizing of original PC, net book, emerge in an endless stream like the mushrooms after rain based on the game service of all kinds of platforms.These recreation have exquisite picture, beautiful personage's design, magnificent figure action, and the game player can control and can participate in the recreation of various scenes through controlling game role, and can link up by the input dialogue literal.

But, in the prior art, when the personage in these recreation of needs speaks; These game roles or can only say several fixing lines; Simply become a mute that can't speak, can only exchange with simple fixing lines or pure words form when the player exchanges in recreation, in recreation such as networking fight such as " fighting landlord ", " mahjong ", " seeing repeatedly "; The player can only carry out the interchange of literal, and the friendliness of recreation loses lustre much for this reason.

Summary of the invention

The application's technical matters to be solved provides a kind of dialogue speech synthesizing method and device, makes player's the game role can be according to vivid the saying of literal of player input.

In order to address the above problem, the application discloses a kind of game sessions speech synthesizing method, comprising:

Step 110 sends to voice cloud with the dialog text that the user imports by interface through the games application end;

Step 120 is carried out natural language processing according to the resulting dialog text of voice cloud, obtains the corresponding harmonious sounds structural information of dialog text with user's input;

Step 130 is to the harmonious sounds structural information that is obtained, in conjunction with mixed synthetic mixed the take speech data corresponding with said dialog text in customized voice storehouse of taking of voice cloud;

Step 140 is taken resulting mix speech data and sends to the games application end by interface;

Step 150 is play through the mixed speech data of taking that application end will receive.

Preferably, said step 130 specifically comprises:

Step 131 to the harmonious sounds structural information that is obtained, is mated with the mixed customization text of taking in the customized voice storehouse; If coupling changes step 132 over to,, change step 133 over to if do not mate;

Step 132 is mixed the synthetic customized voice data of the customized voice data of taking the customized voice storehouse through being called by coupling customization text;

Step 133 will not mated the harmonious sounds structural information and will be adopted the synthetic universal phonetic data of universal phonetic technology;

Step 134 is with described customized voice data and synthetic mixed the take speech data corresponding with said text of universal phonetic data adjustment.

Preferably, said step 132 comprises:

Take the sound bite that stores in the customized voice storehouse by being called to mix, resulting sound bite is decoded obtain the customized voice data then by coupling customization text.

Preferably, described mix each bar customization text take in the customized voice storehouse corresponding a plurality of with customize the identical sound bite of text semantic, the style of said sound bite is different.

Preferably, step 110 also comprises, through the games application end dialogue sound style that the user selects is sent to the voice cloud by interface.

Preferably, described dialogue sound style comprises tone color, and/or dialect, and/or the tone.

Preferably, step 130 also comprises, according to the dialogue sound style of user's selection or the style form of acquiescence, to the harmonious sounds structural information that is obtained, in conjunction with mixed synthetic mixed the take speech data corresponding with said dialog text in customized voice storehouse of taking of voice cloud.

Disclosed herein as well is a kind of game sessions sound synthetic system accordingly, comprising:

Games application end and voice cloud;

Described games application end is used for the sound style of the dialog text of the user's input that receives and user's selection or default style instruction are sent to the voice cloud; The synthetic mixed speech data of taking with receiving said voice cloud, and will be said mixedly take speech data and play;

Described voice cloud is used to provide interface, receives sound style or default style instruction that dialog text that the games application end sends and user select, and said dialog text synthesized mixedly takes speech data and send to the games application end.

Preferably, described voice cloud comprises:

The natural language processing module is used to obtain the corresponding harmonious sounds structural information of text with user's input;

Mix and take the phonetic synthesis module, be used for where the harmonious sounds structural information that is obtained is taken speech data for mixing.

Preferably, described mix taken the phonetic synthesis module and comprises:

The text matches submodule, customized voice synthon module, universal phonetic synthon module, voice adjustment synthon module;

Said text matches submodule, be used for described harmonious sounds structural information with mix the customization text take the customized voice storehouse and mate, mated the harmonious sounds structural information that customize text and not with the harmonious sounds structural information that customizes text matches;

Said customized voice synthon module is used for synthetic customized voice data of having mated the harmonious sounds structural information that customizes text;

Said universal phonetic synthon module, be used for synthetic not with the universal phonetic data of the harmonious sounds structural information of customization text matches;

Said voice adjustment synthon module is used for described customized voice data and described universal phonetic data are taken speech data by synthetic the mixing of the processing sequence of user input text.

Compared with prior art, the application has the following advantages:

The application talks with sound and sends to application end the words text in the recreation is synthetic what the games application end sent through the voice cloud; The user can be play with multiple sound style at the dialog text of games application end input, thereby the friendliness of recreation is increased greatly.

Description of drawings

Fig. 1 is the schematic flow sheet of a kind of game sessions speech synthesizing method of the application;

Fig. 2 is the schematic flow sheet of the preferred a kind of game sessions speech synthesizing method of the application;

Fig. 3 is the structural representation of a kind of game sessions sound synthetic system of the application.

Embodiment

For above-mentioned purpose, the feature and advantage that make the application can be more obviously understandable, the application is done further detailed explanation below in conjunction with accompanying drawing and embodiment.

With reference to Fig. 1, show the schematic flow sheet of a kind of game sessions speech synthesizing method of the application;

Described method flow comprises:

Step 110 sends to voice cloud with the dialog text that the user imports by interface through the games application end.

At the games application end, when certain user wants to link up with other users, the user can be in recreation the input dialogue text, the interface that in real time dialog text provided through the voice cloud of games application end sends to the voice cloud then.Such as fighting in type games application end of recreation in networkings such as " fighting landlord ", " mahjong ", " seeing repeatedly ", the dialog text that the user imports in recreation can send to the voice cloud through interface by the games application end.

In addition, the user can also select the synthetic dialogue sound style of expectation at the games application end, and sends to the voice cloud by the games application end through interface, perhaps also can adopt the default sound style, by voice cloud at random style that selects a sound when synthetic.

The recreation application end is carried out mutual through the API (Application Programming Interface, API) of voice cloud with the voice cloud in reality.The games application end makes the voice cloud be in the init state of replying of relative this games application end after sending information to the voice cloud.

Here, the dialog text of user's input can be alphabetic characters such as Chinese, English, arabic numeral, Roman number.

Only be example below with Chinese.

The user can also select the sound style of dialogue through application end in reality; To select selected dialogue sound style also to send to the voice cloud then through interface; Make the voice cloud synthesize corresponding voice according to the sound style that receives; Perhaps also can adopt the default sound style, by voice cloud at random style that selects a sound when synthetic.Described dialogue sound style comprises tone color, and/or dialect, and/or the tone.Sound such as these dialogues can be set to adult male voice, grow up female voice, boy's sound, young girl's sound etc.; Can also select perhaps to select various dialects at random by own demand by the client on this basis, such as Sichuan dialect, Guangdong dialect, northeast dialect, Hunan dialect etc.

Step 120 is carried out natural language processing according to the resulting dialog text of voice cloud, obtains the corresponding harmonious sounds structural information of dialog text with user's input.

Receive the dialog text of games application end transmission through interface when the voice cloud after, said text is comprised grammer participle (part-of-speech tagging, pinyin marking), numeric character processing, polyphone processing, rhythm Boundary Prediction, modified tone treatment and other steps by the voice cloud.

The net result of natural language processing is to be the harmonious sounds structural information of the storage of unit with the word, and it has comprised information such as phoneme information (like phonetic, tone), prosodic phrase, rhythm border, stress.

The net result of natural language processing is to be the harmonious sounds structural information of the storage of unit with the word, and it has comprised information such as phonetic, tone, prosodic phrase.

For example, " People's Republic of China (PRC) founded the state for 60 anniversaries in 2009 when input.", this step obtains " 60 anniversaries of in 200 9/People's Republic of China (PRC)/foundation after these words are handled.", wherein comprised three prosodic phrases, also comprised information such as corresponding phonetic, tone certainly, first these three prosodic phrases of processed in sequence of system's this moment.When input text shorter, such as input " centre ", so rhythm border intelligence be divided into " " after, be " middle/", the voice cloud also can be handled it according to a prosodic phrase.

Step 130 is to the harmonious sounds structural information that is obtained, in conjunction with mixed synthetic mixed the take speech data corresponding with said dialog text in customized voice storehouse of taking of voice cloud.

With resulting harmonious sounds structural information; Such as information such as phonetic, tone, prosodic phrase, rhythm border, stresses; Set to call to mix at random or according to user preferences and take a kind of in the multiple sound bite with identical semanteme that customization is good in advance of customized voice storehouse, synthetic then mixing taken the customized voice data.

In reality, owing to reasons such as costs, mix and to take enough big that customized voice storehouse tailored range possibly make, need combination universal phonetic storehouse to un-customized to language partly replenish.

Mixed the taking of voice cloud stores a large amount of customization texts and customized voice fragment in the customized voice storehouse in reality, and wherein the index of each sound bite is confirmed by its corresponding customization text and an attendant number; Each section customized voice is all recorded according to the customization text by true man earlier, adopts coding methods such as G729 or G723 to compress these recording then.

With the Chinese character is example; Customization text " weather is very good " can corresponding grow up tone color such as male voice, grow up schoolgirl, spadger's sound, little girl's sound and corresponding Sichuan dialect, Guangdong dialect, northeast dialect, Hunan dialect, etc. different-style, obtain corresponding having identical semanteme and the different customized voice fragment of style through recording.

The sound bite that then said recording is obtained carries out compressing storing with coding methods such as G729 or G723 and takes in the customized voice storehouse mixed.

The style of the sound bite of wherein said customization text comprises the various styles that can be selected by the user that are provided with corresponding to application end.

Step 140 is taken resulting mix speech data and sends to the games application end by interface.

The mixed speech data of selecting through the games application end by the user of taking that the voice cloud is synthetic sends to the games application end through interface again.

Step 150 is play the speech data that receives through application end.

The mixed speech data of taking that the games application end will receive is play, and lets the user that sends dialog text and can hear rich and varied voice with user that the one of which piece is played.Such as; When 3 users carry out " fighting landlord " recreation in a room; The dialog text of one of them user's input is handled the back is played to this room by the games application end all users through abovementioned steps; The dialogue of user's plain text form can be expressed through various speech forms like this, and can expose user's oneself sound, has enriched game environment.

With reference to Fig. 2, show the schematic flow sheet of the preferred a kind of game sessions speech synthesizing method of the application.

Step 210 sends to voice cloud with the dialog text that the user imports by interface through the games application end.

Step 220 is carried out natural language processing according to the resulting dialog text of voice cloud, obtains the corresponding harmonious sounds structural information of dialog text with user's input.

In this step, at first with the prosodic phrase in the harmonious sounds structural information that is obtained, take the customization text that customization is good in advance in the customized voice storehouse and mate with mixing, be that unit carries out the longest coupling with the customization text with minimum prosodic phrase during coupling.

For example, the customization text in the customization storehouse has " the Chinese people " and " People's Republic of China (PRC) ", but does not contain " in 2009 " and " founding the state for 60 anniversaries ".When obtain " 60 anniversaries of in 200 9/People's Republic of China (PRC)/foundation by step 102." wherein system is by text-processing sequencing processing prosodic phrase wherein, processing sequence is " in 2009 ", " People's Republic of China (PRC) ", " founding the state for 60 anniversaries ".

At this moment, system at first successively preface carry out character string contrast coupling.

At first first round coupling is carried out in " in 2009 ", discovery " in 2009 " can't be mated the customization text, changes step 133 over to, with its synthetic universal phonetic;

Then " People's Republic of China (PRC) " is carried out first round coupling, match the Chinese people, its matching length is 4; Carry out second again and take turns coupling, match " People's Republic of China (PRC) ", matching length is 7; Carry out the third round coupling again, promptly do not stop, final matching result is " People's Republic of China (PRC) "; Change step 132 over to, with its synthetic customized voice;

At last " founding the state for 60 anniversaries " carried out first round coupling, discovery can't be mated, and changes step 104 over to, with its synthetic universal phonetic.

In the reality, the length of every customization text is at least a function word, and the customization text sorts according to coded sequence, and the customization text sorts according to coded sequence in the customization storehouse.

Step 132, by combined to mix by coupling customization text take the customized voice storehouse synthetic with customize the corresponding customized voice data of text.

This step will be customized text by coupling by what step 131 obtained, and synthetic combination mixes takes the synthetic customized voice in customized voice storehouse.

For example, obtain in the step 131 by matched text " People's Republic of China (PRC) ", call to mix and take the customized voice fragment synthetic speech data in the customized voice storehouse.

Wherein, the dialogue sound style of user's selection of corresponding games application end transmission is called in the selection of described customized voice fragment style.

This step can be taken the sound bite that stores in the customized voice storehouse by being called to mix by coupling customization text in addition, resulting sound bite is decoded obtain the customized voice data then.

Step 133 is synthesized the universal phonetic data with the harmonious sounds structural information of not mating according to the universal phonetic synthesis flow.

For example, with " in 2009 " that obtain in the step 131, " founding the state for 60 anniversaries " adopts existing universal phonetic synthetic technology to synthesize the universal phonetic data.Wherein the universal phonetic generated data can be any phoneme synthesizing method of the prior art.

Step 134 to described customized voice data and universal phonetic data, is taken speech data according to synthetic the mixing of the processing sequence of user input text.

This step receives with synthetic customized voice data and universal phonetic data by the text-processing of abovementioned steps in proper order, and the complete mixed speech data of taking of adjustment cooperation in this order.

For example, at first, step 134 receiving step 131 carries out changing the synthetic universal phonetic data of step 133 over to after the matching judgment to " in 2009 " by the text-processing order;

Then; Step 134 receiving step 131 carries out changing the synthetic customized voice data of step 132 over to after the matching judgment to " People's Republic of China (PRC) " by the text-processing order, and the speech data of " People's Republic of China (PRC) " is connected integration with " in 2009 " speech data that the front receives;

Then; Step 134 receiving step 131 carries out changing the synthetic customized voice data of step 132 over to after the matching judgment to " People's Republic of China (PRC) " by the text-processing order, and the speech data of will " founding the state for 60 anniversaries " is connected integration with " People's Republic of China (PRC) in 2009 " speech data that the front receives;

Finally, complete " People's Republic of China (PRC) founded the state for 60 anniversaries in 2009 in output." speech data, wherein the style of " People's Republic of China (PRC) " is the speech data of certain style of selecting at random or selecting according to consumer taste.Certainly, mix the scope take the customization text in the customized voice storehouse and can customize very widely, also can be made as the customized voice fragment that customizes text and enroll different-style such as above-mentioned " 2009 " are waited.

Step 150 is play the speech data that receives through application end.

The mixed speech data of taking that the games application end will receive is play, and lets the user that sends dialog text and can hear rich and varied voice with user that the one of which piece is played.

Wherein, the difference of Fig. 2 and Fig. 1 is: the step 130 of Fig. 1 becomes the step 131 of Fig. 2, step 132, and step 133 and step 134, other steps are similar basically with Fig. 1 same position step.

With reference to Fig. 3, show the structural representation of a kind of game sessions sound synthetic system of the application.

Games application end and voice cloud;

Described voice cloud is used to provide interface, receives sound style that dialog text that the games application end sends and user select or default style instruction and said dialog text synthesized mixedly to take speech data and send to the games application end.

Preferably, described voice cloud comprises:

Wherein, described mix taken the phonetic synthesis module and comprises:

The text matches submodule, customized voice synthon module, universal phonetic synthon module, voice adjustment synthon module.

Said text matches submodule, be used for described harmonious sounds structural information with mix the customization text take the customized voice storehouse and mate, mated the harmonious sounds structural information that customize text and not with the harmonious sounds structural information that customizes text matches.

Said customized voice synthon module is used for synthetic customized voice data of having mated the harmonious sounds structural information that customizes text.

Wherein, can select synthetic suitable customized voice data according to information such as the sound style that obtains at random, accents.

Said universal phonetic synthon module, be used for synthetic not with the universal phonetic data of the harmonious sounds structural information of customization text matches.

In reality, the games application end is realized reciprocal process through calling voice cloud API (Application Programming Interface, API):

At first, the initial synthetic instance of dialogue sound of games application end makes the voice cloud be in for this games application end and replys init state;

Then, the games application end is provided with the dialogue synthetic parameters and sends to the voice cloud under this instance, and such as the selection of conversational style etc., the voice cloud also was in and replied init state this moment;

Once more, the games application end sends to the voice cloud with the dialog text of user input, and the voice cloud synthesizes and will talk with sound and send to the games application end according to dialog text and the parameter sound that engages in the dialogue;

At last, the games application end stops instance, and the voice cloud is closed and replied, and is in for this games application end and replys final state.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For system embodiment, because it is similar basically with method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.

More than to a kind of game sessions speech synthesizing method and system that the application provided; Carried out detailed introduction; Used concrete example among this paper the application's principle and embodiment are set forth, the explanation of above embodiment just is used to help to understand the application's method and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to the application's thought, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as the restriction to the application.

Claims

1. game sessions speech synthesizing method is characterized in that:

2. a kind of according to claim 1 game sessions speech synthesizing method is characterized in that said step 130 specifically comprises:

3. as claimed in claim 2 mixing taken phoneme synthesizing method, it is characterized in that said step 132 comprises:

4. take phoneme synthesizing method according to described the mixing of claim 2, it is characterized in that:

Described mix each bar customization text take in the customized voice storehouse corresponding a plurality of with customize the identical sound bite of text semantic, the style of said sound bite is different.

5. a kind of according to claim 1 game sessions speech synthesizing method is characterized in that:

Step 110 also comprises, through the games application end dialogue sound style that the user selects is sent to the voice cloud by interface.

6. like claim 4 or 5 said a kind of game sessions speech synthesizing methods, it is characterized in that:

Described dialogue sound style comprises tone color, and/or dialect, and/or the tone.

7. like the said a kind of game sessions speech synthesizing method of claim 2, it is characterized in that:

Step 130 also comprises, according to the dialogue sound style of user's selection or the style form of acquiescence, to the harmonious sounds structural information that is obtained, in conjunction with mixed synthetic mixed the take speech data corresponding with said dialog text in customized voice storehouse of taking of voice cloud.

8. a game sessions sound synthetic system is characterized in that, comprising:

Games application end and voice cloud;

9. a kind of game sessions sound synthetic system as claimed in claim 8 is characterized in that, described voice cloud comprises:

10. a kind of game sessions sound synthetic system as claimed in claim 9 is characterized in that, described mix taken the phonetic synthesis module and comprise: