CN1945692B

CN1945692B - Intelligent method for improving prompting voice matching effect in voice synthetic system

Info

Publication number: CN1945692B
Application number: CN200610096676A
Authority: CN
Inventors: 王仁华; 刘庆峰; 吴晓如; 严峻; 赵志伟; 熊厚余; 李文兵; 于继栋
Original assignee: ZHONGKEDA XUNFEI INFORMATION SCIENCE & TECHNOLOGY Co Ltd ANHUI PROV
Current assignee: iFlytek Co Ltd
Priority date: 2006-10-16
Filing date: 2006-10-16
Publication date: 2010-05-12
Anticipated expiration: 2026-10-16
Also published as: CN1945692A

Abstract

This invention discloses an intelligent method for increasing the matched result of prompt sounds in a phone synthesizing system including setting up the resource of prompt sound library and resource indexes, in which, in the process of synthesizing texts using the phone synthesizing system, the intelligent matching of a synthesized text and resource of prompt sounds is realized after the match of the character layer of prompt sounds, the match of the spelling layer and automatic structure so as to finish the transformation from text to prompt sound, which increases the usability of prompt sounds.

Description

A kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect

Technical field

The present invention relates to a kind of phoneme synthesizing method, be specifically related to a kind ofly finish in the transfer process from the text to the natural-sounding using a computer, the method for text with pre-record speech-sound intelligent coupling and outputting high quality voice.

Background technology

At present, in IVR, call center expected someone's call voice system, operation flow often needed to use simultaneously the prompt tone and the synthetic speech of pre-recording.The pre-record prompting sound adopts true man's recording, the effect nature, and can embody more emotion style, experience for the user with cordiality.Though synthetic speech is clear accurately, also there is a certain distance aspect the tone and the emotion with true man's recording.In concrete the application, the voice of pre-recording are used for reporting the relatively-stationary content of voice service system, normally advance the greeting of system and the prompting of system operation methods.Synthetic speech is used to report the text that content often changes, contains much information, needs are synthetic immediately.The voice of pre-recording combine with synthetic speech, both can satisfy the requirement of hommization in the call voice service, have realized the instant report of multidate information again.The strategy that voice and synthetic speech combine of pre-recording in the speech synthesis system at present is: to the text to be synthesized of user's input, synthesis system is at first compared the text on the character aspect with the text of each prompt tone the prompt tone storehouse, if mate fully, then export the voice data of this prompt tone; If there are differences, then synthesize and export synthetic speech with speech synthesis engine.

There is certain defective in actual applications in above-mentioned synthesis strategy, mainly shows:

1, because synthesis text and prompt tone text must mate on the character aspect fully,, then unmatches if the user is revised as other Chinese character with same phonetic with certain syllable in the prompt tone sentence.

Such as: recorded prompt tone " 2, inquiry into balance " in the sound storehouse,, then can not match that prompt tone of front when the synthesis text of user's input is " two, inquiry into balance ".

2, for the information of same meaning,, just need sparate sound recording one by one if there is the difference of full half-angle, punctuation mark and other symbol.In actual applications, the prompt tone enormous amount is if each bar alert tone information all will be considered from the character aspect, record the suggestion voice of many kinds of characters forms, then can increase the prompt tone storehouse and make workload, prolong sound storehouse fabrication cycle, simultaneously also cause sound storehouse bulk redundancy, cause the wasting of resources.

Summary of the invention

Purpose of the present invention just provides the method that improves the prompt tone utilization ratio in a kind of phonetic synthesis process, be used to improve prompt tone merely from character layer coupling deficiency in actual applications, effectively utilize the prompt tone storehouse, accomplish both to make full use of the natural and tripping of true man's recording, reduce sound storehouse redundancy again, thereby improve the quality of voice service.

The present invention is achieved by the following technical solutions:

A kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect, comprise the prompt tone resource of making the synthesis system needs according to the speech data of prerecording, the making of prompt tone resource comprises sets up the prompt tone index file, index file comprises: the title of each prompt tone, speaker, the deposit position of character content and speech data, the user provides text message to be synthesized to synthesis system then, synthesis system is mated through character layer, the character content of text to be synthesized and prompt tone character content are exportable synthetic speech data after must be in full accord, carry out in the synthetic process of text in the utilization synthesis system, the present invention also needs to handle through phonetic layer coupling and the automatic intellectual analysis that makes up of prompt tone; Index file also includes: Pinyin information and prompt tone make up the speech data in sound storehouse automatically and deposit the path.

A kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect, carry out in the building-up process of text in synthesis system, at first carry out the character layer coupling, if not success of character layer coupling, then carry out phonetic layer coupling, if phonetic layer coupling be not success also, the intellectual analysis that then makes up is automatically handled.

A kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect, described character layer coupling, below any one or two or three situation in three kinds of situations synthesis text and prompt tone occur also being considered as the match is successful:

(1) there is the difference of full half-angle in character content;

(2) the middle difference that has symbol of character comprises: single quotation marks, double quotation marks, middle line, backslash, punctuation marks used to enclose the title;

(3) there is the difference of symbol in the character tail, comprising: fullstop, exclamation mark, question mark, branch, comma.

A kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect, described phonetic layer coupling is meant by retrieval prompt tone index file, the Pinyin information of judging text to be synthesized whether with certain prompt tone Pinyin information identical, have only identical, think that just the match is successful, if the match is successful for the phonetic layer,, extract speech data and synthesize broadcast then according to the speech data deposit position that provides in the prompt tone index file.

A kind of intelligent method that promotes the prompt tone matching effect in speech synthesis system, the making that described prompt tone makes up the sound storehouse automatically is by instrument, extracts the speech parameter information of prompt tone resource, and stores in the binary file mode.

A kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect, if character layer, phonetic layer coupling all do not have successfully, synthesis system is according to the information of text to be synthesized, according to big corpus composition algorithm, automatically make up from prompt tone and to select the voice elementary cell the sound storehouse and carry out waveform concatenation, finally export speech data and play.

A kind of intelligent method that promotes the prompt tone matching effect in speech synthesis system, described voice elementary cell is a word.

Beneficial effect of the present invention is:

At first, prompt tone character layer coupling expands to from the original complete striking resemblances of all characters and can ignore symbol full-shape half-angle, end of the sentence punctuate and other sentence; Prompt tone phonetic layer coupling can realize the text of the different symbol of unisonance; Automatically structure has been realized coupling and the structure of prompt tone in the word one-level especially, these improve the success ratio that has effectively increased the prompt tone coupling, reduce the user owing to need record the cost of prompt tone once more to the slight change of text, greatly reduce the workload that the prompt tone storehouse makes.

Secondly, speech synthesis system is in the process of management suggestion voice, employing science algorithm guarantees the natural transition and linkage of suggestion voice and synthetic speech, in the inner format conversion of processed voice automatically of synthesis system, and provide visualization tool to help the user to solve specific (special) requirements in the practical application, such as the energy contrast of adjusting suggestion voice and synthetic speech etc.These mechanism have satisfied the flexibility ratio and the personalized requirement of practical application better.

In addition, native system also provides the customized prompts sound storehouse towards industry, satisfies the needs that the different industries prompt tone is used.

Description of drawings

Accompanying drawing is a prompt tone intelligence coupling workflow block diagram.

Embodiment

Referring to accompanying drawing.

At first make the prompt tone resource and be put in the synthesis system resource, the user enables the prompt tone function in speech synthesis system then, behind the input synthesis text, synthesis system can be according to the characteristic of text, carry out the intelligence coupling with the prompt tone resource in the prompt tone storehouse, comprise the matching process of three levels: ground floor is the character layer coupling; The second layer is called phonetic layer coupling, solves the character difference, but the identical prompt tone composition problem of phonetic; The 3rd layer of coupling is called prompt tone and makes up automatically, solve a large amount of prompt tones record finish after, how effectively utilize existing problem of resource. when any one deck matches, will call corresponding prompt sound speech data and splice, thereby the output speech data.

The performing step of the intelligent coupling of prompt tone comprises four steps: the prompt tone base resource is made, character layer mates, the phonetic layer mates and automatic the structure.

The first step, resources making.

In prompt tone intelligence matching process, relate to three resources, be specially:

(1) prompt tone index file, in order to write down the prompt tone clauses and subclauses that all are recorded, comprise each prompt tone title, speaker, character content, Pinyin information, speech data (referring to: make up sound storehouse or packing resource automatically) deposit path etc.

(2) prompt tone makes up the sound storehouse automatically, and voice technology professional extracts the speech parameter information of prompt tone by the resources making instrument, and stores in the binary file mode, forms prompt tone sound storehouse; Again because this process is complete machine robotization, so claim that this sound storehouse is that prompt tone makes up the sound storehouse automatically.

(3) prompt tone packing resource, some prompt tone data are arranged, owing to time or other reason are not made into automatic structure sound storehouse, but these prompt tones also need use in a kind of resource mode, at this moment the user just can use instrument that the voice technology professional provides with prompt tone speech data (as: wav etc.) packing, forms prompt tone packing resource; Attention: the prompt tone in the packing resource can only be used as character layer and phonetic layer coupling.

Second step, the character layer coupling.

The user imports synthesis text in synthesis system after, synthesis system is at first carried out text analyzing, comprises operations such as subordinate sentence, character processing, and after text analyzing, system just can carry out the ground floor coupling: the character layer coupling.

The specific implementation flow process of character layer coupling is:

According to current synthetic content of text, relatively whether the character content of text is identical with the character content of certain prompt tone from the character aspect, and here identical comprises following several situation:

(1) character content is just the same;

(2) there is the difference of full half-angle in character content;

(3) the middle difference that has symbol of character; Comprise: ' (single quotation marks), " " (double quotation marks) ,-(middle line) ,/(backslash),＜(punctuation marks used to enclose the title);

(4) there is the difference of symbol in the character tail, comprising: fullstop, exclamation mark, question mark, branch, comma.

For example: " this incentive method is finally explained ownership China Merchants Bank to suppose that prompt tone is arranged.", the user imports the text after following several change, still can match this prompt tone:

(1) this incentive method/final power of interpretation/genus China Merchants Bank.

(2) this incentive method is finally explained ownership " China Merchants Bank ".

(3) this incentive method is finally explained ownership＜China Merchants Bank 〉.

(4) this incentive method---ownership China Merchants Bank finally explained?

If character layer matches, will arrive and obtain its voice data path in the prompt tone index file, make up automatically from prompt tone according to the path and extract speech data storehouse or the prompt tone packing resource and play.

The 3rd step, phonetic layer coupling.

If it fails to match for the synthesis text ground floor, system will carry out prosodic analysis to synthesis text, determines the Pinyin information (Chinese refers to the Chinese phonetic alphabet, and English refers to phonetic symbol) of synthesis text, and Pinyin information has been arranged, and we just can carry out second layer coupling: phonetic layer coupling.

Phonetic layer matching condition: have only when the Pinyin information of the Pinyin information of synthesis text and certain prompt tone in full accordly, think that just the match is successful.Such as:

(1) prompt tone " finishes input by number sign key ", and the user can import " by the input of # bond bundle " mates;

(2) prompt tone " new password length is less than six ", the user can import " new password length is less than 6 " mates;

(5) prompt tone " please be imported the sixteen bit card number ", and the user can import " please import 16 card numbers " mates;

(6) prompt tone " trust of 1/3rd storehouses ", the user can import " trust of 1/3 storehouse " mates.

(7) prompt tone " Chinese University of Science and Technology's news fly ", the user can import " Chinese University of Science and Technology is fast to fly " mates.

If the match is successful for the phonetic layer, synthesis system will arrive and obtain its voice data path in the prompt tone index file, makes up automatically from prompt tone according to the path and extracts speech data storehouse or the prompt tone packing resource and play.

In the 4th step, make up automatically.

If character layer coupling and phonetic layer coupling are all failed, just need carry out the 3rd layer of intelligence coupling: make up coupling automatically.The so-called coupling that makes up is automatically carried out the splicing of voice by the voice elementary cell in the prompt tone sound storehouse exactly, and is guaranteed that splicing effect is better than synthetic speech, near natural-sounding.

It is as follows that prompt tone makes up implementation automatically:

Synthesis system is according to the information of text to be synthesized, according to big corpus composition algorithm, makes up automatically from prompt tone and to select the voice elementary cell the sound storehouse and carry out waveform concatenation, finally exports speech data.Attention: in order to guarantee splicing effect, requiring the voice elementary cell is word, rather than syllable.

Automatically the difference that makes up coupling and preceding two-layer coupling is: the unit of preceding two-layer coupling is a sentence, that is to say, if the match is successful, the voice data of directly getting corresponding prompt tone sentence plays back; And the unit that makes up coupling automatically is a word, after the word match success, word need be spliced into sentence, and then play.The user does not just need to record more prompt tone like this, has reduced the cost of user's recorded speech and has effectively utilized existing prompt tone resource.

Claims

1. intelligent method that in speech synthesis system, promotes the prompt tone matching effect, comprise the prompt tone resource of making the synthesis system needs according to the speech data of prerecording, the making of prompt tone resource comprises sets up the prompt tone index file, index file comprises: the title of each prompt tone, speaker, the deposit position of character content and speech data, the user provides text message to be synthesized to synthesis system then, synthesis system is mated through character layer, the character content of text to be synthesized and prompt tone character content are exportable synthetic speech data after must be in full accord, it is characterized in that using synthesis system to carry out in the synthetic process of text, handle through phonetic layer coupling and the automatic intellectual analysis that makes up of prompt tone; Index file also includes: Pinyin information and prompt tone make up the speech data in sound storehouse automatically and deposit the path.

2. a kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect according to claim 1, it is characterized in that carrying out in the building-up process of text in synthesis system, at first carry out the character layer coupling, if not success of character layer coupling, then carry out phonetic layer coupling, if phonetic layer coupling be not success also, the intellectual analysis that then makes up is automatically handled.

3. a kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect according to claim 1, it is characterized in that described character layer coupling, below any one or two or three situation in three kinds of situations synthesis text and prompt tone occur also being considered as the match is successful:

(1) there is the difference of full half-angle in character content;

4. a kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect according to claim 1, it is characterized in that described phonetic layer coupling is meant by retrieval prompt tone index file, the Pinyin information of judging text to be synthesized whether with certain prompt tone Pinyin information identical, have only identical, think that just the match is successful, if the match is successful for the phonetic layer, then, extract speech data and synthesize broadcast according to the speech data deposit position that provides in the prompt tone index file.

5. a kind of intelligent method that in speech synthesis system, promotes the prompt tone matching effect according to claim 1, it is characterized in that the making that described prompt tone makes up the sound storehouse automatically is to pass through instrument, extract the speech parameter information of prompt tone resource, and store in the binary file mode.

6. a kind of according to claim 1 or 5 intelligent method that in speech synthesis system, promotes the prompt tone matching effect, it is characterized in that if character layer, phonetic layer coupling all do not have successfully, synthesis system is according to the information of text to be synthesized, according to big corpus composition algorithm, automatically make up from prompt tone and to select the voice elementary cell the sound storehouse and carry out waveform concatenation, finally export speech data and play.

7. a kind of intelligent method that promotes the prompt tone matching effect in speech synthesis system according to claim 6 is characterized in that described voice elementary cell is a word.