CN1118493A

CN1118493A - Language and speech converting system with synchronous fundamental tone waves

Info

Publication number: CN1118493A
Application number: CN 94107920
Authority: CN
Inventors: 吕士楠; 初敏; 关定华
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 1994-08-01
Filing date: 1994-08-01
Publication date: 1996-03-13

Abstract

The Chinese-character character-to-speech conversion system is composed of speech library formed by waveform sample values of all basic syllables and synchronous marks, rhyme rule library including word tone, accent and sentence tone patterns, and conversion unit which takes syllable waveforms from speech library, regulates their interval, pitch and strength, and combines them into clear and natural speech. It features simple apparatus, less mathematical operation, real-time processing and clear and natural speech.

Description

Language and speech converting system with synchronous fundamental tone waves

The present invention relates to voice compound technology.

At present, both at home and abroad speech synthesis technique mainly contains two kinds of parameter synthetic technology and Waveform Synthesis Technology.The parameter synthetic technology is the most rational theoretically, but this technology is too dependent on linguistics and etic development, because the verbal production model is perfect inadequately, the tonequality of synthetic speech is always unsatisfactory.The waveform concatenation technology can synthesize clear, natural voice when being used for limited vocabularies such as voice table or motorbus station reporting instrument when synthetic.But when simple waveform concatenation method was used for the Chinese text to speech system of unlimited vocabulary, owing to can not change the parameters,acoustic of crude sampling waveform so that it is adapted to different context language environments, the naturalness of the feasible continuous speech that synthesizes was very poor.Because even the pronunciation of Chinese single-syllable is very clear, but article is read on ground of a word, always is difficult to allow the audience take like a shot.For improving naturalness, the Chinese synthesis system that Northern Transportation University's information research is developed, except the waveform sampling value of holding the basic syllable of Chinese about 1,300, the waveform sampling value that has also added nearly 10,000 everyday words is as concatenation unit in the sound storehouse.The synthetic speech of this system increases than the naturalness that isolated syllable splices the voice that, but still undesirable.Even because in natural language speech, prosodic features such as its pronunciation character, particularly pitch, the duration of a sound and loudness of a sound be subjected to contextual influence also very big.If the same pronunciation of machine-made employing all is factitious in most of language environments when synthetic.And the number of Chinese word is very big, and the everyday words sound storehouse of doing a kind of pronunciation of every speech is not easy, develops a big vocabulary sound bank with multiple pronunciation, almost is impossible at present.Though the Chinese synthesis system of department of computer science, Tsinghua university development is concatenation unit with the single syllable, " reading again " and " gently reading " two kinds of pronunciations deposited in same syllable, also deposits light tone syllable simultaneously in.Sound the comparison nature with the synthetic isolated word of this system, but the naturalness of synthetic connected speech is not high.Still can't carry out overall situation control to prosodic features such as pitch, the duration of a sound and loudness of a sound.Simply a syllable is divided into and reads again and gently read still can not form the intonation of nature, thereby the naturalness of the continuous speech that synthesizes does not still reach practical requirement.

In a word, in existing Chinese synthetic technology, can the flexible prosodic features though parameter is synthetic, parameter is provided with difficulty, and the sharpness of synthetic speech can't be satisfactory.And existing simple waveform concatenation is synthetic, and syllable or speech can be listened finely, but owing to can not arbitrarily control the prosodic features of synthetic speech, makes that the naturalness of synthetic speech is not high, is difficult to the public and accepts.

The Chinese text to speech system that the purpose of this invention is to provide a kind of high definition and high naturalness.This system is based on pitch synchronous waveform superimposing technique, with the Chinese single-syllable is synthesis unit, there is one to have tone pattern, word stress pattern and sentence mode transfer formula, can carry out speech one-level and the processing of sentence one-level to input text, thus the prosodic rules storehouse of the prosodic features such as pitch, the duration of a sound and loudness of a sound of each syllable in the comprehensive planning in short.It can convert the Chinese language text that has a small amount of prosodic sign to the smooth Chinese characters spoken language output with news broadcast style, for computing machine provides new man-machine interface.It also can be used for the automatic Public Address System at airport, harbour and station, and various information are consulted system automatically, and helping of disabled person say, help and read and aspect such as the office automation system.

Technical scheme of the present invention is implemented by following step.

1. utilizing the strong and number of syllables features of limited of independence of Chinese syllable, is synthesis unit with the syllable, sets up the sound bank that is made of all basic syllable sampled value of Chinese and pitch synchronous mark thereof.

2. set up one and have that tone pattern, word stress pattern and sentence mode transfer formula constitute, can carry out the prosodic rules storehouse that the rhythm of speech and two levels of sentence is regulated the prosodic features such as pitch, the duration of a sound and loudness of a sound of each syllable in a word.

3. utilize pitch synchronous waveform superimposing technique, the syllable waveform that takes out from sound bank is carried out the duration of a sound, pitch and loudness of a sound by prosodic rules regulate.

4. the syllable waveform concatenation after will regulating becomes statement, makes it can export the continuous speech of lamprophonia, smoothness, nature.

In conjunction with the accompanying drawings the present invention is described in detail as follows:

Fig. 1 is the process flow diagram of language and speech converting system with synchronous fundamental tone waves.During system works, after starting switch 1 is opened, pronunciation speed and keynote are set, only at the default value of pronunciation speed and keynote and expression normal articulation not simultaneously, just need to set newly to be worth at initialization module 2.The building-up process of native system is finished sentence by sentence, and syllable is spliced into sentence in sentence buffer A12 or sentence buffer B13.Be used alternatingly with two sentence buffers, reach the purpose of exporting voice incessantly.When each sentence begins to synthesize, at first differentiate the state of sentence buffer A or sentence buffer B by decision device 3, if a buffer A or a sentence buffer B are in idle condition, then text load module 4 reads in a sentence from Chinese language text 5; Otherwise system is in the circular wait state.Input Chinese language text 5 is disk files, or the Chinese language text of directly importing from keyboard.For avoiding the polyphone problem, input text is to be made of Chinese character of representing with Chinese phonetic symbols and prosodic sign.Prosodic sign comprises speech border, breath-group border, subordinate sentence border, stress grade and duration of a sound adjusting symbol etc.To write standard format by the Chinese phonetic alphabet given speech border and subordinate sentence border in these symbols, its excess-three, and system is provided with default value, only just needs to add in text a small amount of prosodic sign where necessary.Text load module 4 is at first swept the statement text that reads in, and Chinese character and prosodic sign are separated, and is determined sentence patterns such as declarative sentence, interrogative sentence and exclamative sentence simultaneously by the punctuation mark of end of the sentence.For Chinese character, except the syllable name harmony tone mark of correspondence, in the process of pan, compose simultaneously and give its some characteristic information relevant with its position in sentence.Then, press the order of syllable in sentence,, from sound storehouse 7, take out corresponding waveform segment, from prosodic rules storehouse 8, obtain basic prosodic parameter by the characteristic information of each syllable simultaneously to syllable buffer 6 according to the syllable name.System carries out the planning of speech one-level and sentence one-level respectively at pitch, the duration of a sound and the loudness of a sound of pitch regulator 9, duration of a sound regulator 10,11 pairs of syllables of loudness of a sound regulator, and with the pitch synchronous superposition algorithm pitch, the duration of a sound and loudness of a sound by the syllable waveform of obtaining in the sound storehouse is adjusted to the desired value of being planned.At last, adjusted syllable waveform is write a buffer A12 or sentence buffer B13 in order, and between linguistic units such as speech, breath-group and subordinate sentence, add suitable pause.Whether decision device 15 is differentiated sentence and is finished, if do not finish, then returns syllable buffer 6, handles next syllable; Is otherwise to decision device 16, whether playback over to judge sentence in another buffer? if playback is over, then arrive playback module 17, it broadcasts by loudspeaker 14 by the sampling rate of the setting data that splicing in the sentence buffer is good; Otherwise circular wait.Sentence buffer playback one finishes, and zero clearing at once is so that begin the building-up process of new sentence.System's playback first begins after first statement disposes.After, by interrupt control, the time interval when utilizing playback between two sampling points, carry out every data processing of building-up process, playback and data processing are carried out simultaneously.By switch S 1 and S2 two sentence buffers are used in turn, thereby reach the purpose of real-time output.The state of switch S 1 shown in Figure 1 and S2 is in a buffer A12 and is receiving data, and sentence buffer B13 is in playback; Otherwise sentence buffer B13 receives data, and sentence buffer A12 is in playback.Whether terminal decision device 18 is differentiated text and is finished, if do not finish, then turns back to the outlet of the initial module 2 of beginning, synthetic next sentence; Otherwise whole phonetic synthesis process finishes.Like this, just the single syllable in the sound storehouse has been linked to be the continuous statement that prosodic features can be controlled, has been linked to be long speech again, finished Chinese language language transfer process by sentence.Because each syllable all is taken from the sampling of natural-sounding, guaranteed the high definition of synthetic language, and regulated through the rhythm before each syllable splicing, make synthetic language approaching with natural language on whole structure, thereby guaranteed high naturalness.

Introduce the method for building up in sound bank, prosodic rules storehouse and the implementation step of pitch synchronous superimposing technique below in detail:

1. the method for building up of syllable storehouse sound bank

Sound bank among Fig. 1 is made of all basic syllables of Chinese, light tone syllable commonly used and the sampled value of er-suffix syllable and the pitch synchronous mark of all above-mentioned syllables.The pitch synchronous mark also is the key character of the sound bank among the present invention for the prosodic features of regulating syllable with the pitch synchronous superimposing technique is provided with.The establishment step of sound bank is as follows:

(1) work out the recording syllabary, it comprises the basic syllable of all Chinese, light tone syllable commonly used and er-suffix syllable.The recording syllabary ordering is by random arrangement, when avoiding, and orderly influence between the forward and backward syllable.

(2) ask one articulate, the tonequality grace is done some training very often, the speaker accurately of pronouncing.Require speaker by normal speed and loudness, read out the syllable of working out is bright respectively one by one, make recording.Recording is carried out at the recording studio of special use, and the reverberation time in room was advisable with about 0.5 second, and signal to noise ratio (S/N ratio) requires to be higher than 30 decibels.With high-fidelity microphone and amplifier, requiring has smooth response from 20 to 20000Hz.With high-quality specialty analog recording machine or digital audio tape recording.

(3) the record signal is quantized with sampling rate that is higher than 8000Hz and the above quantization of 8 bits, make Chinese syllable waveform sampling file.Also can adopt various compress techniques, to reduce data volume.The intensity of all syllables should be adjusted in the certain limit during sampling, the volume of each syllable when synthesizing can be mated mutually.

(4) the waveform sampling file is done the pitch synchronous mark.Automatically find out or manually find out the starting point of each pitch period of syllable voiced segments with computing machine, and its position is charged in the corresponding label file.

2. the method for building up in prosodic rules storehouse

Prosodic rules among the present invention is template with the news broadcast.The method for building up in prosodic rules storehouse is as follows.Selected some pieces of news releases, style is bright to read out by broadcasting to ask speaker (announcer), makes recording.With the variation of pitch, the duration of a sound and the loudness of a sound of each syllable in instrumentation (as Computerized Speech Lab Model 4300) the analysis text,, formulate the synthetic prosodic rules of Chinese in conjunction with the general knowledge of Chinese phonetics.Main rule comprises:

(1) tone pattern: promptly under normal stress, not two-character word, the combination pitch curve model of three words, four words and multi-character words and the duration of a sound proportionate relationship of each syllable during the same tone combination.

(2) word stress pattern: under different stress grades, the integral body of speech transfers territory and loudness of a sound to change.

(3) sentence mode transfer formula: the rule change in bottom line and accent territory in a word comprises the variation of bottom line in breath-group and subordinate sentence, and transfers the Changing Pattern of territory at breath-group tail and subordinate sentence tail.Change and the variation of accent territory by different end of the sentence bottom lines, reflect different sentence type (declarative sentence, interrogative sentence and exclamative sentence).Inserting the pause of different durations between the different language unit, also is a part of mode transfer formula

Prosodic rules is organized in the sound storehouse after revising and verifying and goes.

3. the implementation step of pitch synchronous superimposing technique

(1) the Hanning window of raw tone waveform and a series of pitch synchronous is multiplied each other obtains the overlapping short-time analysis signal of having of a series of pitch synchronous.The length of Hanning window is got the twice of pitch period.

(2) these short signals are done necessary correction, form a series of composite signals in short-term.In native system, at first determine the pitch period number and the length in each cycle that synthetic waveform should have according to the target fundamental tone curve and the target duration of a sound; Secondly according to the ratio of original waveform pitch period number, determine that some short signal should repeat or delete in the short-time analysis signal series, thereby obtain composite signal series in short-term with synthetic waveform pitch period number.

(3) incite somebody to action composite signal series in short-term, arranging also synchronously with the target pitch period, overlap-add obtains synthetic waveform.At this moment, the synthetic speech waveform just has desired the fundamental tone curve and the duration of a sound.

Embodiment:

Technical scheme of the present invention can realize on the PC more than 286 or various types of workstation that also available special chip constitutes the autonomous system of off-line working.Here introduce the specific implementation of technical scheme of the present invention at the 486IBM/PC compatible of a band sound blaster.

System one starts, at first initialization and the syllable voiced segments duration of a sound is set, transfers the territory lower limit and transfers the standard value of field width degree to be respectively: 250ms, 140Hz and 80Hz.The duration of a sound and transfer the territory in 80% to 150% scope of default value, to set.Transfer the territory lower limit in the scope of 20Hz about the default value, to set.

The collegegirl's of a broadcast specialty pronunciation taken from all syllables in the sound storehouse.Recording is carried out in the special-purpose laboratory of acoustics institute, and the recording studio reverberation time is 0.5 second, and ground unrest is lower than 35dB (A).Require speaker to read aloud single syllable by random layout in proper order by normal word speed and loudness.Speech signal is delivered to professional digital audio tape by high-fidelity microphone and amplifier.Data acquisition is with 14 quantifications, 12kHz sampling rate.

Consisting of of sound storehouse: all the tuning of Chinese band saves 1278 (individual)

Light tone syllable 350 commonly used

Er-suffix syllable 40 commonly used

Only get part syllable commonly used with er-suffix syllable softly.The sound storehouse is built up open, so that add the syllable that needs at any time.

The adjusting of prosodic features at first is that unit carries out with the speech, and Chinese idiom is the same with speech to be treated.Obtain the citation form of the pitch curve of this speech under normal stress and the duration of a sound ratio of each syllable wherein from the prosodic rules storehouse according to the length (syllable number that comprises) of speech and tone array configuration.If duration of a sound coefficient adjustment symbol is arranged, then prolong or shorten the duration of a sound of each syllable in the speech in the ratio of its appointment.Different stress grade is regulated symbol and is contracted, puts ratio in order to what the integral body of control speech was transferred territory and loudness of a sound, expresses needs to satisfy different weights.Next the rhythm that is an one-level is regulated.The bilinear model that the bottom line that the sentence mode transfer formula of Chinese uses high pitch line that the upper limiting frequency by the accent territory of each speech in the sentence links up and lower frequency limit to link up is formed is represented.Its alt line is represented stress, and bottom line is represented rhythm.Bottom line is on a declining curve in each breath-group, and promptly the high starting point terminal point is low.Between breath-group, the starting point of a back breath-group is also lower slightly than the starting point of last breath-group, and is also on a declining curve in subordinate sentence.Similar with breath-group between subordinate sentence and the subordinate sentence, the bottom line of the starting point of back one subordinate sentence is also lower slightly than last subordinate sentence.The accent territory of last the voice speech in each breath-group and 1 subordinate sentence also should dwindle.The bottom line trend and the accent territory of end of the sentence reflect sentence type.In the declarative sentence, the bottom line of the last voice speech has significant decline, transfers the territory also obviously to dwindle.The end of the sentence bottom line of interrogative sentence is raised and is transferred the territory to dwindle.The bottom line of exclamative sentence end of the sentence is similar with declarative sentence, also descends, but transfers the territory that tangible expansion is arranged.The pause pattern also is an important composition of a mode transfer formula.All should add the pause of different durations at speech, breath-group, subordinate sentence and sentence tail, the modulation in tone in the synthetic speech fully be showed, thereby improve the naturalness of synthetic speech.

The probability of occurrence of two-character word and three words is the highest in the Chinese, and the pitch pattern of their various tone combinations is all set separately, and two words, three words are respectively equipped with 24 kinds and 120 kinds of patterns.The tone array mode of multi-character words is a lot, but probability of occurrence is not high, and system has only established four kinds (high and level tone, rising tone, half are gone up and falling tone), seven kinds (high and level tone, rising tone, half are gone up, falling tone and three kinds softly) and eight kinds (high and level tone, rising tone, half go up upward, entirely and three kinds softly) mode respectively for lead-in, middle word and last word.It is in the scope of 140Hz-260Hz that tone curve in all these pitch patterns all normalizes to the accent territory that is equivalent to the normal stress of this speaker, is stored in the prosodic rules storehouse with the form of tables of data.

According to the auditory effect of analysis result and synthetic speech, be provided with as shown in the table by the position duration of a sound coefficient of Chinese two-character word, three words, four words and each syllable determining positions of multi-character words.

The word preface	???????1???????2???????3???????4
The word preface	???????1???????2???????3???????4	Two-character word three words four words	?????0.95?????1.00 ?????0.93?????0.84????0.97 ?????0.90?????0.77????0.86????0.93

In the last table, with the increase of syllable number in the speech, the syllable duration of a sound generally shortens, and system sets every increase by one word based on four words, and the single syllable duration of a sound subtracts 3%, i.e. multi-character words duration of a sound coefficient of diminution:

K＝1－0.03*(N－4)

Wherein N is that speech is long.

For keeping short, the long basic rhythm at interval of Chinese speech pronunciation, the position duration of a sound coefficient that the multi-character words lead-in is set is 0.90*K, and the position duration of a sound coefficient of last word is 0.93*K, and the position duration of a sound coefficient of middle word is pressed 0.77*K or is the 0.86*K alternate.

System is provided with the Pyatyi word stress pattern, is respectively: levant stress, stress, normal sound, schwa and weak schwa.Transferring the territory to change is the most frequently used stress expression way of Chinese, with transferring the territory expansion to represent levant stress and stress respectively for one times and 50%; Mutually anticausticly will transfer the territory to be compressed to 50% and 25% to represent schwa and weak schwa respectively.Chinese also changes expression means as stress with the duration of a sound sometimes, and the effect of transferring the variation of the territory and the duration of a sound that stress is expressed is complementary, and system provides the duration of a sound to regulate symbol, so that the syllable duration of a sound in some speech of independent control synthetic speech.

System adds the pause of different durations between different voice units, improve the timing of speech.The pause of establishing 10ms between the speech of system in same breath-group, the pause of establishing 100ms between the breath-group in the same subordinate sentence, the pause of establishing 200ms between subordinate sentence, the pause of establishing 500ms after the declarative sentence.Establish the pause of 700ms behind interrogative sentence and the exclamative sentence.

System will be adjusted to the desired value of cooking up according to the prosodic rules storehouse by pitch, the duration of a sound and the loudness of a sound of the syllable waveform of obtaining in the sound storehouse with pitch synchronous waveform superposition algorithm.

At last, synthetic speech connects loudspeaker output by sound blaster.

The present invention compared with prior art has the following advantages:

Syllable pronounce distinctly, the nature. This has benefited from the sound storehouse and takes from nature pronunciation, has overcome during parameter is synthetic because parameter setting is forbidden the syllable articulation that causes poor, particularly the shortcoming obscured easily of initial consonant.

2. the naturalness of synthetic continuous speech is good. The tone pattern that is provided by rhythm rule base, word stress pattern and sentence mode transfer formula, what add that the pitch synchronous superimposing technique provides carries out the function that prosodic features is regulated to waveform, has guaranteed the high naturalness of synthetic speech.

3. the Chinese text to speech system of being realized at 486 microcomputers by technical scheme provided by the invention, speech articulation reaches 94% in the composite measurement of syllable, word and sentence, and naturalness reaches 7.8 minutes in ten grades of score standards, extremely near natural language.

Claims

1. language and speech converting system with synchronous fundamental tone waves that constitutes by Waveform Synthesis Technology, it is characterized in that: this system is by the sound bank of all basic syllable sampled value of Chinese and pitch synchronous mark thereof, with by the tone pattern, what word stress pattern and sentence mode transfer formula constituted can be to the pitch of each syllable in a word, the duration of a sound, carrying out speech with prosodic features such as loudness of a sound forms with the prosodic rules storehouse that the rhythm of two levels of sentence is regulated, and utilize pitch synchronous waveform superimposing technique, the syllable waveform that takes out from sound bank is carried out the duration of a sound by prosodic rules, pitch is regulated with loudness of a sound: the syllable waveform concatenation after will regulating becomes statement, and it is clear that it can be exported, smooth, the continuous speech of nature:

2. according to claims 1 described language and speech converting system with synchronous fundamental tone waves, it is characterized in that: during system works, after starting switch is opened, at initialization module pronunciation speed and keynote are set, only not simultaneously at the default value of pronunciation speed and keynote and expression normal articulation, just need to set new value, the building-up process of native system is finished sentence by sentence, syllable is spliced into sentence in sentence buffer A or sentence buffer B, be used alternatingly with two sentence buffers, reach and pay no attention to the purpose of exporting voice disconnectedly, when each sentence begins to synthesize, at first differentiate the state of sentence buffer A or sentence buffer B by decision device, if sentence buffer A or ten days buffer B are in idle condition, then the text load module reads in a sentence from Chinese language text: otherwise system is in the circular wait state, the input Chinese language text is a disk file, or the Chinese language text of directly importing from keyboard, for avoiding the polyphone problem, input text is to be made of Chinese character of representing with Chinese phonetic symbols and prosodic sign, prosodic sign comprises the speech border, the breath-group border, the subordinate sentence border, the stress grade and the duration of a sound are regulated symbol etc., to write standard format by the Chinese phonetic alphabet given speech border and subordinate sentence border in these symbols, its excess-three is individual, system is provided with default value, only just need in text, to add a small amount of prosodic sign where necessary, the text load module is at first swept the statement text that reads in, and separately with Chinese character and prosodic sign, simultaneously determine declarative sentence by the punctuation mark of end of the sentence, sentence pattern such as interrogative sentence and exclamative sentence, for Chinese character, except the syllable name harmony tone mark of correspondence, in the process of pan, compose to give simultaneously it some with its at the relevant characteristic information in the position of sentence towel, then, press the order of syllable in sentence, according to the syllable name, from the sound storehouse, take out corresponding waveform segment to the syllable buffer, from the prosodic rules storehouse, obtain basic prosodic parameter by the characteristic information of each syllable simultaneously, system is respectively at the pitch regulator, duration of a sound regulator, the loudness of a sound regulator is to the pitch of syllable, the duration of a sound and loudness of a sound carry out the planning of speech one-level and sentence one-level, and with the pitch synchronous superposition algorithm by the pitch of the syllable waveform of obtaining in the sound storehouse, the duration of a sound and loudness of a sound are adjusted to the desired value of being planned, at last, adjusted syllable waveform is write a buffer A or sentence buffer B in order, and at speech, add suitable pause between the linguistic unit such as breath-group and subordinate sentence, whether decision device is differentiated sentence and is finished, if do not finish, then return the syllable buffer, handle next syllable; Is otherwise to decision device, whether playback over to judge sentence in another buffer? if playback is over, then arrive playback module, it broadcasts by loudspeaker by the sampling rate of the setting data that splicing in the sentence buffer is good; Otherwise circular wait, sentence buffer playback one finishes, zero clearing at once, so that begin the building-up process of new sentence, system's playback first begins after first statement disposes, after, pass through interrupt control, the time interval when utilizing playback between two sampling points, carry out every data processing of building-up process, playback and data processing are carried out simultaneously, by switch S 1 and S2 two sentence buffers are used in turn, thereby reach the purpose of real-time output, the state of switch S shown in the figure 1 and S2 is in a buffer A and is receiving data, and sentence buffer B is in playback; Otherwise sentence buffer B receives data, and sentence buffer A is in playback, and whether the terminal decision device is differentiated text and finished, if do not finish, then turns back to the outlet of the initialization module of beginning, synthetic next sentence; Otherwise whole phonetic synthesis process finishes, like this, just the single syllable in the sound storehouse has been linked to be the continuous statement that prosodic features can be controlled, be linked to be long speech again by sentence, finished Chinese language language transfer process, because each syllable all is taken from the sampling of natural-sounding, guaranteed the high definition of synthetic language, and regulate through the rhythm before each syllable splicing, make synthetic language approaching with natural language on whole structure, thereby guaranteed high naturalness.