CN1196531A

CN1196531A - Articulation compounding method for computer phonetic signal

Info

Publication number: CN1196531A
Application number: CN 97110082
Authority: CN
Inventors: 张景嵩; 曹洪; 张金玉
Original assignee: Inventec Corp
Current assignee: Inventec Corp
Priority date: 1997-04-14
Filing date: 1997-04-14
Publication date: 1998-10-21
Anticipated expiration: 2017-04-14
Also published as: CN1111811C

Abstract

A composing method for computer speech sound signals features that the transition part between two adjacent sylables from the middle position of the formed to the middle position of the latter in English word is used as the dual-phoneme of its composed pronunciation. As the changing information between adjacent syllables in an English word is retained to maximum, so the computer speech sound is very similar to human speech sound.

Description

The pronunciation synthetic method of machine word tone signal

Traditional computing machine is owing to be subjected to the speed limit of its central processing unit and the memory limitations of memory storage (as: hard disk etc.), operation method and employed basic synthesis unit that computer speech synthesizes are simpler, cause to synthesize the effect and the primary sound of text-to-speech far apart, though there is the part dealer many new operation methods to be arranged for the voice design that acquisition meets former sound effective value,, so far not only still can not thoroughly deal with problems, even also not have on the sound effect significantly and improve.

Because computing machine science and technology is under the situation of the rapid progress of related hardware equipment now, for the deviser provides processor and bigger storage space faster, therefore, for speech synthesis technique, the deviser not only can adopt complicated synthetic and compaction algorithms method, and being used for the unit of synthetic speech also can be bigger, comprises more voice messaging in these unit thereby make, so computing machine science and technology has been moulded a splendid design environment really now.Even so, speech synthesis technique but still exists the problem of distortion when making synthetic speech now, and this problem of dtmf distortion DTMF is caused by phonetic synthesis operation method in the speech synthesis technique and compaction algorithms method.

With English word " HELLO " is example, traditional speech synthesis technique is being found out its International Phonetic Symbols＜halo at English word〉after, at first be syncopated as＜h〉according to traditional cutting method,＜a 〉,＜l〉and＜o〉wait the composition phoneme, and find out its separation, from relevant pronunciation data storehouse, extract relevant pronunciation according to these phonemes, but in fact when these phonemes merge connection mutually, because reciprocal effect between each phoneme, there is not the branch area under a person's administration, and there is the section of a reciprocal effect, and press sampled point disjunction phoneme, must cause phoneme impure, impure phoneme is when connecting, and natural sharpness is low, noise is big, sound is coarse and the machine statement is apparent.

Therefore, the object of the present invention is to provide a kind of pronunciation synthetic method of machine word tone signal, by method of the present invention, can effectively improve the synthetic accuracy rate of word, make its pronunciation produce the effect of more speaking near true man, and effective operating speed that increases synthetic pronunciation, thereby overcome the various shortcomings that produced when above-mentioned classic method is carried out the synthetic processing of computer speech to English word.

The pronunciation synthetic method of machine word tone signal of the present invention comprises:

At first with true man's orthoepy input pronunciation receiver of word, the voice signal of this word produces the digital voice data of this word after the A/D converter sampling processing;

Via the sound-editing device, these data by the position of each vowel or consonant and and front and back vowel or consonant between the relation of influencing each other, transition portion by previous syllable centre position in adjacent two syllables to a back syllable centre position is syncopated as more than one diphones;

According to each diphones that is syncopated as, suitably adjust the voice signal of identical diphones in the various words by the acoustical correction device, and the voice signal recording of this diphones made the pronunciation data storehouse, thereby the elementary cell when making the diphones of being gathered in the pronunciation data storehouse be more suitable for as synthetic various words voice;

When utilizing diphones to synthesize word pronunciation, at first read in word by computing machine, obtain its corresponding International Phonetic Symbols by analyzing word, again the pairing International Phonetic Symbols are resolved into diphones, and after being converted to the diphones sequence number, computing machine promptly extracts corresponding audio digital signals according to this sequence number in the pronunciation data storehouse of being recorded, and decompressed by gunzip, to obtain the voice signal of this diphones, and then with obtained voice signal merging, and through smoothing processing, thereby the orthoepy of synthetic this word.

Description of drawings:

Shown in Figure 1 is the schematic flow sheet of gathering the diphones unit among the present invention;

Fig. 2 is the synoptic diagram that explanation diphones element analysis of the present invention constitutes;

The present invention of being shown in Figure 3 utilizes the schematic flow sheet of the synthetic pronunciation of words in diphones unit;

Figure 4 and 5 are oscillogram and corresponding energy spectrums of vowel " O ";

Fig. 6 and 7 is oscillogram and corresponding energy spectrums of the vowel " O " after handling through falling tone.

Below, will be described in detail a preferred embodiment of the present invention in conjunction with the accompanying drawings.

The present invention mainly is to utilize the elementary cell of diphones as the synthetic pronunciation of English word, wherein so-called diphones is meant the transition portion of adjacent two syllables in the English word, that is in adjacent two syllables of English word by previous syllable centre position to the transition portion in a back syllable centre position, as being example with word " HELLO ", its International Phonetic Symbols are＜halo 〉, then the transition portion of adjacent two syllables is expressed as follows in this word:

Wherein ^*The empty sound or quiet of symbology.If represent with the International Phonetic Symbols, then this word " HELLO " promptly be by＜ ^*H 〉,＜ha 〉,＜al 〉,＜lo〉and＜o ^*Wait diphones to form.

Hence one can see that, the pronunciation of English word promptly is made up of each diphones unit, and the method for gathering diphones, shown in the 1st figure, mainly be earlier word to be imported pronunciation receiver via true man with orthoepy, the voice signal of word produces the digital voice data of this word after the sampling processing of A/D converter, these data are carried out the cutting processing through the sound-editing device according to the inventive method again, to be syncopated as the diphones of forming this word pronunciation signal.Because identical diphones still may have some differences in the various words in pronunciation, thereby, suitably adjust the voice signal of identical diphones in the various words by the acoustical correction device, the elementary cell in the time of just can making the diphones that is obtained more to be applicable to synthetic various words voice.At last, again each the diphones utilization recording and the compress technique of being gathered is recorded on it in pronunciation data storehouse, when synthetic speech, can utilizes the diphones in this pronunciation data storehouse, with the orthoepy of synthetic word.

The present invention can be by summarizing about 1600 diphones in 80,000 English words according to aforementioned diphones principle, and utilize these diphones to synthesize the pronunciation of word, therefore, desire to synthesize the computer speech that more approaches true man's voice effect, should depend on the acquisition mode of these diphones fully at English word.Therefore, how obtaining required diphones, will be the key of synthetic tonequality quality in the decision diphones synthetic method of the present invention, so, when utilizing phonetic synthesis and recording technology to record the pronunciation data storehouse of diphones, the velocity of sound (length of pronunciation) and the volume of essential suitably control diphones.

Diphones of the present invention unit mainly is made up of English intemational phonetic symbols the most basic vowel and consonant, its composition mode comprises composition modes such as primary and secondary sound, mother and sons' sound, vowel and consonant, and wherein vowel also claims vowel, and consonant also claims consonant, in general, vowel and consonant respectively have its pronunciation characteristic, and the vowel amplitude is bigger, and waveform is more regular, cycle is also more obvious, little in the sound amplitude, waveform is irregular, and the cycle is than irregularities.

Yet, no matter be consonant or vowel, its amplitude still roughly has one by low and high, the low change procedure by height, thereby in the present invention for guaranteeing that the double-tone of being sampled have enough amplitudes of variation and correlativity, when selecting to be used for the voice segments of cutting diphones, should carry out according to the following steps (referring to Fig. 2):

1) prepares a jumbo sound bank earlier, and draw and its corresponding parameters information-phoneme numbering (PhonemeLabel), tone rank (PitchLevel), energy rank (PowerLevel).

2) sound bank is carried out LPC (16 rank) spectrum analysis.

3) voice segments to identical phoneme numbering calculates the average frequency spectrum characteristic, and gained result's mean value AverageK is the weighted sum of each frequency spectrum parameter.

4) with spectral characteristic near the voice segments of AverageK as the synthesis unit data.

5) after selected voice segments, beginning cutting diphones.

When the cutting diphones, necessary according to following rule:

1) the crest from the crest cutting of previous syllable to a back syllable.

2) because English word is to be spliced by several diphones, therefore, the amplitude of each diphones, length must be very suitable.

3) for making diphones keep the complete of its cycle, the two ends that the cutting diphones begins and finishes to be the wave period starting point when splicing, the plain two ends of single-tone that meaning is promptly formed this diphones are the wave period starting point, and the necessary phase place of its waveform phase contact is identical.Otherwise if last phoneme rises with positive rate of change, second phoneme connects with negative rate of change at once, then noise will occur.

4) the same syllable of different diphones should have the roughly the same cycle, and therefore, during with these diphones splicings, intonation just can be unified.

Compare with the single-tone element with the semitone element that tradition is used, the present invention's diphones is owing to the steady section cutting that is each syllable from English word is got off, thereby can farthest keep the change information of each inter-syllable in the English word, therefore, utilize the present invention to synthesize the computer speech that more approaches true man's pronunciation at English word.

With English word " HELLO " is example, and diphones cutting of the present invention is carried out according to the following step:

1) at first, finds out its correct International Phonetic Symbols＜halo〉at this English word " HELLO ";

2) again according to this International Phonetic Symbols＜halo〉position of each vowel or consonant and and front and back vowel or consonant between the relation of influencing each other, according to rules of pronunciation be cut into＜ ^*H 〉,＜ha 〉,＜al 〉,＜lo〉and＜o ^*Wait section, wherein symbol ^*Represent empty sound or quiet, and be syncopated as＜ ^*H 〉,＜ha 〉,＜al 〉,＜lo〉and＜o ^*Wait section, the i.e. diphones that the present invention is alleged.

The cut-off that it should be noted that each section especially is the steady section mid point at the pure tone element, so, and when the pronunciation splicing of this section is synthesized, owing to be to connect with same phoneme, so, connect more steady.

The present invention is when utilizing diphones to synthesize word pronunciation, its treatment step is referring to shown in Figure 3, at first, read in word by computing machine, obtain its corresponding International Phonetic Symbols by analyzing word, again the pairing International Phonetic Symbols are resolved into diphones, and after being converted to the diphones sequence number, computing machine is promptly retrieved corresponding phonetic coding signal according to the diphones sequence number in the pronunciation data storehouse that the present invention recorded.If retrieve, then extract the digital signal of seeking, and decompressed by gunzip, to obtain the speech data of diphones, then, obtained speech data is merged, again through smoothing processing, promptly synthesize the orthoepy of this word.

For example, these data are merged back voice signal resulting, that merge be called S (i), S (i) is done the mean value smoothing Filtering Processing.Get in this signal contiguous 3 frames (frame refers to a sampling period) and do calculating: the voice signal S (i) of present frame=A1S (p)+A2S (i)+A3S (s).

A1, A2, A3-weighting coefficient

S (p)-former frame speech data

S (s)-back one frame speech data

Because voice signal is the tone synchronous difference coding PSDC (Pitch Synchronized Differential Coding) based on pulse code modulation (pcm), can realize tone control easily when synthetic.When voice signal is adjusted to target period length T tar by Cycle Length Torg, use the hamming code window Hamming window W (i) of a length as T=2Torg, signal S (i) after the conversion=W (i) S (i)+W (T/2-i) S (i+a), wherein a=Ttar-Torg.For avoiding synthetic speech quality to degenerate, restricted T org/2＜Ttar＜2Torg.

Fig. 4,5 is the oscillogram and the corresponding energy spectrum of vowel " O ".

Fig. 6,7 is the oscillogram and the corresponding energy spectrum of the vowel " O " after falling tone is handled, and with Fig. 4,5 contrasts can find out that the signal after the conversion has kept the characteristics of speech sounds of all frequency bands of original signal, and distortion is very little.

Be example with word " HELLO " still, its pairing International Phonetic Symbols are＜halo 〉, the present invention is according to the following steps when utilizing diphones to synthesize word pronunciation:

1) earlier with this phonetic symbol＜halo〉be syncopated as＜ ^*H 〉,＜he 〉,＜el 〉,＜lo〉and＜o ^*Wait diphones;

2) correspond to diphones sequence number 12,19,23,33 and 78 etc. in the pronunciation data storehouse according to each diphones again, from this pronunciation data storehouse, extract the audio digital signals of these diphones;

3) relend the audio digital signals that helps gunzip just to be extracted and decompressed,, then, obtained voice signal is merged,, promptly synthesize the orthoepy of this word again through smoothing processing to obtain the voice signal of diphones.

The above; it only is a preferred embodiment of the present invention; just because of this; claim scope of the present invention is not limited thereto; every those skilled in the art; modification and the equivalence done according to technology contents disclosed in this invention change, and all should not break away from protection scope of the present invention.

Claims

1, a kind of pronunciation synthetic method of machine word tone signal comprises:

2, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, wherein the cutting of diphones can be by the crest cutting of the previous syllable crest to a back syllable.

3, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, the amplitude of described diphones, length must be quite.

4, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, the two ends of wherein forming the single-tone element of described diphones are the wave period starting point, and the necessary phase place of its waveform phase contact is identical.

5, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, wherein the same syllable of different diphones should have the roughly the same cycle.