CN1196531A - Articulation compounding method for computer phonetic signal - Google Patents

Articulation compounding method for computer phonetic signal Download PDF

Info

Publication number
CN1196531A
CN1196531A CN 97110082 CN97110082A CN1196531A CN 1196531 A CN1196531 A CN 1196531A CN 97110082 CN97110082 CN 97110082 CN 97110082 A CN97110082 A CN 97110082A CN 1196531 A CN1196531 A CN 1196531A
Authority
CN
China
Prior art keywords
diphones
word
pronunciation
syllable
synthetic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 97110082
Other languages
Chinese (zh)
Other versions
CN1111811C (en
Inventor
张景嵩
曹洪
张金玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inventec Corp
Original Assignee
Inventec Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Corp filed Critical Inventec Corp
Priority to CN 97110082 priority Critical patent/CN1111811C/en
Publication of CN1196531A publication Critical patent/CN1196531A/en
Application granted granted Critical
Publication of CN1111811C publication Critical patent/CN1111811C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

A composing method for computer speech sound signals features that the transition part between two adjacent sylables from the middle position of the formed to the middle position of the latter in English word is used as the dual-phoneme of its composed pronunciation. As the changing information between adjacent syllables in an English word is retained to maximum, so the computer speech sound is very similar to human speech sound.

Description

The pronunciation synthetic method of machine word tone signal
Traditional computing machine is owing to be subjected to the speed limit of its central processing unit and the memory limitations of memory storage (as: hard disk etc.), operation method and employed basic synthesis unit that computer speech synthesizes are simpler, cause to synthesize the effect and the primary sound of text-to-speech far apart, though there is the part dealer many new operation methods to be arranged for the voice design that acquisition meets former sound effective value,, so far not only still can not thoroughly deal with problems, even also not have on the sound effect significantly and improve.
Because computing machine science and technology is under the situation of the rapid progress of related hardware equipment now, for the deviser provides processor and bigger storage space faster, therefore, for speech synthesis technique, the deviser not only can adopt complicated synthetic and compaction algorithms method, and being used for the unit of synthetic speech also can be bigger, comprises more voice messaging in these unit thereby make, so computing machine science and technology has been moulded a splendid design environment really now.Even so, speech synthesis technique but still exists the problem of distortion when making synthetic speech now, and this problem of dtmf distortion DTMF is caused by phonetic synthesis operation method in the speech synthesis technique and compaction algorithms method.
With English word " HELLO " is example, traditional speech synthesis technique is being found out its International Phonetic Symbols<halo at English word〉after, at first be syncopated as<h〉according to traditional cutting method,<a 〉,<l〉and<o〉wait the composition phoneme, and find out its separation, from relevant pronunciation data storehouse, extract relevant pronunciation according to these phonemes, but in fact when these phonemes merge connection mutually, because reciprocal effect between each phoneme, there is not the branch area under a person's administration, and there is the section of a reciprocal effect, and press sampled point disjunction phoneme, must cause phoneme impure, impure phoneme is when connecting, and natural sharpness is low, noise is big, sound is coarse and the machine statement is apparent.
Therefore, the object of the present invention is to provide a kind of pronunciation synthetic method of machine word tone signal, by method of the present invention, can effectively improve the synthetic accuracy rate of word, make its pronunciation produce the effect of more speaking near true man, and effective operating speed that increases synthetic pronunciation, thereby overcome the various shortcomings that produced when above-mentioned classic method is carried out the synthetic processing of computer speech to English word.
The pronunciation synthetic method of machine word tone signal of the present invention comprises:
At first with true man's orthoepy input pronunciation receiver of word, the voice signal of this word produces the digital voice data of this word after the A/D converter sampling processing;
Via the sound-editing device, these data by the position of each vowel or consonant and and front and back vowel or consonant between the relation of influencing each other, transition portion by previous syllable centre position in adjacent two syllables to a back syllable centre position is syncopated as more than one diphones;
According to each diphones that is syncopated as, suitably adjust the voice signal of identical diphones in the various words by the acoustical correction device, and the voice signal recording of this diphones made the pronunciation data storehouse, thereby the elementary cell when making the diphones of being gathered in the pronunciation data storehouse be more suitable for as synthetic various words voice;
When utilizing diphones to synthesize word pronunciation, at first read in word by computing machine, obtain its corresponding International Phonetic Symbols by analyzing word, again the pairing International Phonetic Symbols are resolved into diphones, and after being converted to the diphones sequence number, computing machine promptly extracts corresponding audio digital signals according to this sequence number in the pronunciation data storehouse of being recorded, and decompressed by gunzip, to obtain the voice signal of this diphones, and then with obtained voice signal merging, and through smoothing processing, thereby the orthoepy of synthetic this word.
Description of drawings:
Shown in Figure 1 is the schematic flow sheet of gathering the diphones unit among the present invention;
Fig. 2 is the synoptic diagram that explanation diphones element analysis of the present invention constitutes;
The present invention of being shown in Figure 3 utilizes the schematic flow sheet of the synthetic pronunciation of words in diphones unit;
Figure 4 and 5 are oscillogram and corresponding energy spectrums of vowel " O ";
Fig. 6 and 7 is oscillogram and corresponding energy spectrums of the vowel " O " after handling through falling tone.
Below, will be described in detail a preferred embodiment of the present invention in conjunction with the accompanying drawings.
The present invention mainly is to utilize the elementary cell of diphones as the synthetic pronunciation of English word, wherein so-called diphones is meant the transition portion of adjacent two syllables in the English word, that is in adjacent two syllables of English word by previous syllable centre position to the transition portion in a back syllable centre position, as being example with word " HELLO ", its International Phonetic Symbols are<halo 〉, then the transition portion of adjacent two syllables is expressed as follows in this word:
Figure A9711008200041
Wherein *The empty sound or quiet of symbology.If represent with the International Phonetic Symbols, then this word " HELLO " promptly be by< *H 〉,<ha 〉,<al 〉,<lo〉and<o *Wait diphones to form.
Hence one can see that, the pronunciation of English word promptly is made up of each diphones unit, and the method for gathering diphones, shown in the 1st figure, mainly be earlier word to be imported pronunciation receiver via true man with orthoepy, the voice signal of word produces the digital voice data of this word after the sampling processing of A/D converter, these data are carried out the cutting processing through the sound-editing device according to the inventive method again, to be syncopated as the diphones of forming this word pronunciation signal.Because identical diphones still may have some differences in the various words in pronunciation, thereby, suitably adjust the voice signal of identical diphones in the various words by the acoustical correction device, the elementary cell in the time of just can making the diphones that is obtained more to be applicable to synthetic various words voice.At last, again each the diphones utilization recording and the compress technique of being gathered is recorded on it in pronunciation data storehouse, when synthetic speech, can utilizes the diphones in this pronunciation data storehouse, with the orthoepy of synthetic word.
The present invention can be by summarizing about 1600 diphones in 80,000 English words according to aforementioned diphones principle, and utilize these diphones to synthesize the pronunciation of word, therefore, desire to synthesize the computer speech that more approaches true man's voice effect, should depend on the acquisition mode of these diphones fully at English word.Therefore, how obtaining required diphones, will be the key of synthetic tonequality quality in the decision diphones synthetic method of the present invention, so, when utilizing phonetic synthesis and recording technology to record the pronunciation data storehouse of diphones, the velocity of sound (length of pronunciation) and the volume of essential suitably control diphones.
Diphones of the present invention unit mainly is made up of English intemational phonetic symbols the most basic vowel and consonant, its composition mode comprises composition modes such as primary and secondary sound, mother and sons' sound, vowel and consonant, and wherein vowel also claims vowel, and consonant also claims consonant, in general, vowel and consonant respectively have its pronunciation characteristic, and the vowel amplitude is bigger, and waveform is more regular, cycle is also more obvious, little in the sound amplitude, waveform is irregular, and the cycle is than irregularities.
Yet, no matter be consonant or vowel, its amplitude still roughly has one by low and high, the low change procedure by height, thereby in the present invention for guaranteeing that the double-tone of being sampled have enough amplitudes of variation and correlativity, when selecting to be used for the voice segments of cutting diphones, should carry out according to the following steps (referring to Fig. 2):
1) prepares a jumbo sound bank earlier, and draw and its corresponding parameters information-phoneme numbering (PhonemeLabel), tone rank (PitchLevel), energy rank (PowerLevel).
2) sound bank is carried out LPC (16 rank) spectrum analysis.
3) voice segments to identical phoneme numbering calculates the average frequency spectrum characteristic, and gained result's mean value AverageK is the weighted sum of each frequency spectrum parameter.
4) with spectral characteristic near the voice segments of AverageK as the synthesis unit data.
5) after selected voice segments, beginning cutting diphones.
When the cutting diphones, necessary according to following rule:
1) the crest from the crest cutting of previous syllable to a back syllable.
2) because English word is to be spliced by several diphones, therefore, the amplitude of each diphones, length must be very suitable.
3) for making diphones keep the complete of its cycle, the two ends that the cutting diphones begins and finishes to be the wave period starting point when splicing, the plain two ends of single-tone that meaning is promptly formed this diphones are the wave period starting point, and the necessary phase place of its waveform phase contact is identical.Otherwise if last phoneme rises with positive rate of change, second phoneme connects with negative rate of change at once, then noise will occur.
4) the same syllable of different diphones should have the roughly the same cycle, and therefore, during with these diphones splicings, intonation just can be unified.
Compare with the single-tone element with the semitone element that tradition is used, the present invention's diphones is owing to the steady section cutting that is each syllable from English word is got off, thereby can farthest keep the change information of each inter-syllable in the English word, therefore, utilize the present invention to synthesize the computer speech that more approaches true man's pronunciation at English word.
With English word " HELLO " is example, and diphones cutting of the present invention is carried out according to the following step:
1) at first, finds out its correct International Phonetic Symbols<halo〉at this English word " HELLO ";
2) again according to this International Phonetic Symbols<halo〉position of each vowel or consonant and and front and back vowel or consonant between the relation of influencing each other, according to rules of pronunciation be cut into< *H 〉,<ha 〉,<al 〉,<lo〉and<o *Wait section, wherein symbol *Represent empty sound or quiet, and be syncopated as< *H 〉,<ha 〉,<al 〉,<lo〉and<o *Wait section, the i.e. diphones that the present invention is alleged.
The cut-off that it should be noted that each section especially is the steady section mid point at the pure tone element, so, and when the pronunciation splicing of this section is synthesized, owing to be to connect with same phoneme, so, connect more steady.
The present invention is when utilizing diphones to synthesize word pronunciation, its treatment step is referring to shown in Figure 3, at first, read in word by computing machine, obtain its corresponding International Phonetic Symbols by analyzing word, again the pairing International Phonetic Symbols are resolved into diphones, and after being converted to the diphones sequence number, computing machine is promptly retrieved corresponding phonetic coding signal according to the diphones sequence number in the pronunciation data storehouse that the present invention recorded.If retrieve, then extract the digital signal of seeking, and decompressed by gunzip, to obtain the speech data of diphones, then, obtained speech data is merged, again through smoothing processing, promptly synthesize the orthoepy of this word.
For example, these data are merged back voice signal resulting, that merge be called S (i), S (i) is done the mean value smoothing Filtering Processing.Get in this signal contiguous 3 frames (frame refers to a sampling period) and do calculating: the voice signal S (i) of present frame=A1S (p)+A2S (i)+A3S (s).
A1, A2, A3-weighting coefficient
S (p)-former frame speech data
S (s)-back one frame speech data
Because voice signal is the tone synchronous difference coding PSDC (Pitch Synchronized Differential Coding) based on pulse code modulation (pcm), can realize tone control easily when synthetic.When voice signal is adjusted to target period length T tar by Cycle Length Torg, use the hamming code window Hamming window W (i) of a length as T=2Torg, signal S (i) after the conversion=W (i) S (i)+W (T/2-i) S (i+a), wherein a=Ttar-Torg.For avoiding synthetic speech quality to degenerate, restricted T org/2<Ttar<2Torg.
Fig. 4,5 is the oscillogram and the corresponding energy spectrum of vowel " O ".
Fig. 6,7 is the oscillogram and the corresponding energy spectrum of the vowel " O " after falling tone is handled, and with Fig. 4,5 contrasts can find out that the signal after the conversion has kept the characteristics of speech sounds of all frequency bands of original signal, and distortion is very little.
Be example with word " HELLO " still, its pairing International Phonetic Symbols are<halo 〉, the present invention is according to the following steps when utilizing diphones to synthesize word pronunciation:
1) earlier with this phonetic symbol<halo〉be syncopated as< *H 〉,<he 〉,<el 〉,<lo〉and<o *Wait diphones;
2) correspond to diphones sequence number 12,19,23,33 and 78 etc. in the pronunciation data storehouse according to each diphones again, from this pronunciation data storehouse, extract the audio digital signals of these diphones;
3) relend the audio digital signals that helps gunzip just to be extracted and decompressed,, then, obtained voice signal is merged,, promptly synthesize the orthoepy of this word again through smoothing processing to obtain the voice signal of diphones.
The above; it only is a preferred embodiment of the present invention; just because of this; claim scope of the present invention is not limited thereto; every those skilled in the art; modification and the equivalence done according to technology contents disclosed in this invention change, and all should not break away from protection scope of the present invention.

Claims (5)

1, a kind of pronunciation synthetic method of machine word tone signal comprises:
At first with true man's orthoepy input pronunciation receiver of word, the voice signal of this word produces the digital voice data of this word after the A/D converter sampling processing;
Via the sound-editing device, these data by the position of each vowel or consonant and and front and back vowel or consonant between the relation of influencing each other, transition portion by previous syllable centre position in adjacent two syllables to a back syllable centre position is syncopated as more than one diphones;
According to each diphones that is syncopated as, suitably adjust the voice signal of identical diphones in the various words by the acoustical correction device, and the voice signal recording of this diphones made the pronunciation data storehouse, thereby the elementary cell when making the diphones of being gathered in the pronunciation data storehouse be more suitable for as synthetic various words voice;
When utilizing diphones to synthesize word pronunciation, at first read in word by computing machine, obtain its corresponding International Phonetic Symbols by analyzing word, again the pairing International Phonetic Symbols are resolved into diphones, and after being converted to the diphones sequence number, computing machine promptly extracts corresponding audio digital signals according to this sequence number in the pronunciation data storehouse of being recorded, and decompressed by gunzip, to obtain the voice signal of this diphones, and then with obtained voice signal merging, and through smoothing processing, thereby the orthoepy of synthetic this word.
2, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, wherein the cutting of diphones can be by the crest cutting of the previous syllable crest to a back syllable.
3, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, the amplitude of described diphones, length must be quite.
4, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, the two ends of wherein forming the single-tone element of described diphones are the wave period starting point, and the necessary phase place of its waveform phase contact is identical.
5, the pronunciation synthetic method of machine word tone signal as claimed in claim 1 is characterized in that, wherein the same syllable of different diphones should have the roughly the same cycle.
CN 97110082 1997-04-14 1997-04-14 Articulation compounding method for computer phonetic signal Expired - Fee Related CN1111811C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 97110082 CN1111811C (en) 1997-04-14 1997-04-14 Articulation compounding method for computer phonetic signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 97110082 CN1111811C (en) 1997-04-14 1997-04-14 Articulation compounding method for computer phonetic signal

Publications (2)

Publication Number Publication Date
CN1196531A true CN1196531A (en) 1998-10-21
CN1111811C CN1111811C (en) 2003-06-18

Family

ID=5171305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 97110082 Expired - Fee Related CN1111811C (en) 1997-04-14 1997-04-14 Articulation compounding method for computer phonetic signal

Country Status (1)

Country Link
CN (1) CN1111811C (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100353358C (en) * 1999-03-22 2007-12-05 Lg电子株式会社 Image apparatus with education function and its controlling method
CN1667699B (en) * 2004-03-10 2010-06-23 微软公司 Generating large units of graphonemes with mutual information criterion for letter to sound conversion
CN109389968A (en) * 2018-09-30 2019-02-26 平安科技(深圳)有限公司 Based on double-tone section mashed up waveform concatenation method, apparatus, equipment and storage medium
CN112071299A (en) * 2020-09-09 2020-12-11 腾讯音乐娱乐科技(深圳)有限公司 Neural network model training method, audio generation method and device and electronic equipment
CN112530404A (en) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 Voice synthesis method, voice synthesis device and intelligent equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100353358C (en) * 1999-03-22 2007-12-05 Lg电子株式会社 Image apparatus with education function and its controlling method
CN1667699B (en) * 2004-03-10 2010-06-23 微软公司 Generating large units of graphonemes with mutual information criterion for letter to sound conversion
CN109389968A (en) * 2018-09-30 2019-02-26 平安科技(深圳)有限公司 Based on double-tone section mashed up waveform concatenation method, apparatus, equipment and storage medium
CN109389968B (en) * 2018-09-30 2023-08-18 平安科技(深圳)有限公司 Waveform splicing method, device, equipment and storage medium based on double syllable mixing and lapping
CN112071299A (en) * 2020-09-09 2020-12-11 腾讯音乐娱乐科技(深圳)有限公司 Neural network model training method, audio generation method and device and electronic equipment
CN112530404A (en) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 Voice synthesis method, voice synthesis device and intelligent equipment

Also Published As

Publication number Publication date
CN1111811C (en) 2003-06-18

Similar Documents

Publication Publication Date Title
EP1704558B1 (en) Corpus-based speech synthesis based on segment recombination
CN1190236A (en) Speech synthesizing system and redundancy-reduced waveform database therefor
US7930172B2 (en) Global boundary-centric feature extraction and associated discontinuity metrics
US8706488B2 (en) Methods and apparatus for formant-based voice synthesis
US20080091428A1 (en) Methods and apparatus related to pruning for concatenative text-to-speech synthesis
JP3557662B2 (en) Speech encoding method and speech decoding method, and speech encoding device and speech decoding device
US20150262587A1 (en) Pitch Synchronous Speech Coding Based on Timbre Vectors
WO1993018505A1 (en) Voice transformation system
CN101930747A (en) Method and device for converting voice into mouth shape image
EP0970466A2 (en) Voice conversion system and methodology
JP3680374B2 (en) Speech synthesis method
CN1111811C (en) Articulation compounding method for computer phonetic signal
US5715363A (en) Method and apparatus for processing speech
US5452398A (en) Speech analysis method and device for suppyling data to synthesize speech with diminished spectral distortion at the time of pitch change
JPS5827200A (en) Voice recognition unit
CN1089045A (en) The computer speech of Chinese-character text is monitored and critique system
JP3058640B2 (en) Encoding method
D'haes et al. Discrete cepstrum coefficients as perceptual features
JPH03233500A (en) Voice synthesis system and device used for same
TW318238B (en) Pronunciation synthesization method of computer voice signal
JPH07104793A (en) Encoding device and decoding device for voice
CN117316162A (en) Automatic far and near adjusting technology for sound field in voice
Modegi MIDI encoding method based on variable frame-length analysis and its evaluation of coding precision
JPH0756590A (en) Device and method for voice synthesis and recording medium
JPH05127697A (en) Speech synthesis method by division of linear transfer section of formant

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20030618

Termination date: 20110414