CN1122936A - Chinese spoken language distinguishing and synthesis type vocoder - Google Patents

Chinese spoken language distinguishing and synthesis type vocoder Download PDF

Info

Publication number
CN1122936A
CN1122936A CN94118778A CN94118778A CN1122936A CN 1122936 A CN1122936 A CN 1122936A CN 94118778 A CN94118778 A CN 94118778A CN 94118778 A CN94118778 A CN 94118778A CN 1122936 A CN1122936 A CN 1122936A
Authority
CN
China
Prior art keywords
syllable
parameter
speech
voice
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN94118778A
Other languages
Chinese (zh)
Other versions
CN1085367C (en
Inventor
易克初
程俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN94118778A priority Critical patent/CN1085367C/en
Publication of CN1122936A publication Critical patent/CN1122936A/en
Application granted granted Critical
Publication of CN1085367C publication Critical patent/CN1085367C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

Its aim is to manufacture telecommunication equipment with a bit rate lower than 250 bits/sec. and capable of output of high-quality pronounciation. It is based on pronounciation recognition and sysnthesis technique to realize the code with syllable as unit, the syllable categories include various syllables in common Chinese speech, applied syllable liaison, fundamental tone outline, volume outline, etc. rhythmic characteristics to ensure understandability of output pronounciation, the input method is compatible with continuous pronounciation and intermittent one, all of them can output fluent language statement.

Description

Chinese spoken language distinguishing and synthesis type vocoder
The invention belongs to electrical communication technology, particularly relate to the very low bit rate vocoder.
In recent years, bit rate be lower than 1000bps (bps) the research of very low bit rate speech coding technology be subjected to common concern because the many occasions such as voice mail business in the Speech Communication in the short wave channel, E-mail address all press for this technology.If the bit rate of voice can be compressed to below the 200bps, then its bit rate can be comparable with telegram, at this moment also can develop many unprecedented novel voices and handle application.
But, the lot of documents of delivering in recent ten years shows: be with the Bit-Rate Reduction of speech data below 400bps, employing is very poor with the synthetic obtainable output voice quality of various encryption algorithm institutes based on speech analysis, is difficult to reach the public degree from accepting that can allow.Its reason is because the coding unit of this analysis synthesis type vocoder is a frame or a few frame voice signal.One frame voice signal is a segment signal of 10 milliseconds to 30 milliseconds normally, and its characteristic variations does not have and studies carefully, and encoding with a limited glossary of symbols, the voice signal that promptly means recovery is unavoidable to produce intolerable distortion.
Corresponding identification synthesis type vocoder is to encode for coding unit with phonetic unit (or claim speech primitive, as phoneme, syllable or speech).The phoneme of any language or syllable are a limited number of set.This vocoder adopts speech recognition technology to carry out speech primitive identification and coding in the transmission part, and receiving unit is according to the speech primitive code string of receiving and some additional prosodic information synthetic speech again.This vocoder need be in channel transmission parameters seldom, and be by regular synthetic speech at receiving end, therefore can but can recover high-quality voice with extremely low bit rate transmission or storage speech parameter.
On August 6th, 1986 disclosed Chinese invention patent---extra-low numeric code rate Chinese identification vocoder (patent No. CN85.1.00576A) be exactly around this principle propose with Chinese continuous speech with syllable, initial consonant, simple or compound vowel of a Chinese syllable be the basis discern, encode, synthetic to constitute the basic thought of very low bit rate vocoder, but this patent remains in some defective, make it to be difficult to implement, even implement successfully can not guarantee to obtain the output voice of high intelligibility.Its defective mainly shows:
1. the syllable in the syllabary has only considered that 1300 routines in the mandarin have the adjusting syllable, do not consider the frequent special syllable that occurs in the spoken language that general user speaks standard Chinese pronunciation, as er-suffix syllable, light tone syllable, and because of user's syllable outside conventional syllable that custom or language environment influence produce that pronounces.This defective must cause exporting the voice naturalness, can understand and fill reduction, in addition because of the mistake of scarce syllable replace and produce meaning of one's words expression mistake.
2. do not consider the vital role of the intelligibility of the involutory one-tenth statement of prosodic features, send Lou or ignored the utilization of some critical prosodic parameters, thereby can't guarantee to export the high intelligibility of voice.For example: do not transmit the parameter that can reflect that word is divided in the statement in the channel, receiving end just can't carry out word and divide, synthesis unit also just can't be distinguished in the same speech between two adjacent syllables between two adjacent segments with different speech in the significant difference aspect the coarticulation influence, having to, it is pending synthetic that each syllable in the statement is seen on an equal basis, consequently grievous injury synthesizes the sharpness and the naturalness of polysyllabic word in the statement, simultaneously also with the naturalness and the intelligibility of the whole statement of grievous injury, because lack the statement that foundation divided in word, usually can't solve the problem of polysemy on the meaning of one's words.No matter be to listen by the people to distinguish that still handling with an automatic speech understanding system all is like this.Actual measurement shows: the intelligibility of this synthetic speech is lower than 70%.
3. only considered the mode of continuous speech input, do not considered that the discrimination that complete pronunciation Chinese when import because of continuous speech saves recognizer can't guarantee, and the mode that adopts syllable of a syllable or speech ground of a speech to import intermittently of having to.How to guarantee this moment to export the smooth and natural problem of voice, this invention does not propose any measure.
One of purpose of the present invention provides a kind of new construction that can overcome the Chinese spoken language distinguishing and synthesis type vocoder of above-mentioned defective, and two of purpose is a kind of prosodic information disposal routes of design Chinese spoken language distinguishing and synthesis type vocoder.
Chinese spoken language distinguishing and synthesis type vocoder of the present invention is by sending and receiving the two large divisions and a Chinese syllables constitutes, the sending part branch comprises speech recognition, prosodic analysis and three unit of parameter coding, receiving unit comprises parameter decoding, prosodic parameter conversion and three unit of phonetic synthesis, as shown in Figure 1.
The transmission part of input voice through making a start discerned on syllable ground one by one automatically, is encoded to syllable code, extracts the prosodic features parameter in the voice simultaneously, carries out compressed encoding; After receiving end is carried out parameter decoding, utilize wherein syllable code string and prosodic parameter to synthesize statement output again.
Syllabary is that transmission and reception two parts are shared, totally 1866 of various syllables commonly used in general user's mandarin spoken language have been comprised in the table, comprise that promptly 1300 routines have the tuning joint in the mandarin, 332 extra have a tuning joint because of what speaker dialect custom or voice environment produced, also have 94 er-suffix syllables and 140 light tone syllables.Can find its sound, rhyme, tone and er-suffix syllable whether according to the sequence number of each syllable from table, anti-mistake also can by sound, rhyme, tone and whether er-suffix syllable be determined its syllable sequence number.
The rhythm signal processing unit comprises the prosodic parameter converting unit that sends part prosodic analysis unit and receiving unit, and they cooperate speech recognition, phonetic synthesis and parameter encoding unit to realize following prosodic information disposal route:
1. in the detection sound connection parameter automatically of making a start, to indicate whether current syllable and next syllable belong to same multisyllable (comprising the double-tone joint) speech, receiving end is utilized this sound connection parameter to carry out word and is divided, and utilizes the synthetic high-quality word of syllable coarticulation rule in the speech, and then constitutes statement.
2. carry out pitch Detection in the frame by frame of making a start, and the fundamental tone outline line (being the curve of pitch period time to time change) of each syllable carried out compressed encoding, after receiving end restored the fundamental tone outline line, the fundamental tone outline line of the synthetic syllable of control was similar to the fundamental tone outline line of former syllable.
3. carry out the detection frame by frame of signal intensity (short-time energy or amplitude) in short-term making a start, and the amplitude profile line (being the curve of signal intensity time to time change) of each syllable carried out compressed encoding, behind the receiving end reduction amplitude profile line, the amplitude profile line of controlling each synthetic syllable is similar to the amplitude profile line of former input syllable.
End-point detecting method in the recognition unit and audio recognition method can be imported and isolated word (comprising monosyllable) input dual mode by compatible continuous speech, and by utilizing sound connection parameter to guarantee the statement fluency and the naturalness of synthesizing.
Beneficial effect of the present invention is as follows:
1. the syllable in the syllabary comprises the syllable that often occurs in general user's mandarin spoken language, has avoided because of the infull identification error that causes of syllable, and the wherein utilization of er-suffix syllable and light tone syllable has improved synthetic statement naturalness and intelligibility significantly.
2. transmission of the present invention and receiving unit all are provided with the prosodic information processing unit, adopted multiple prosodic information disposal route, the transmission extracting section prosodic features parameter of making a start, and compressed code is transferred to receiving end, the receiving unit of receiving end effectively utilizes the rhythm adjustment that they carry out synthetic speech, thereby has played critical effect to improving this vocoder output voice quality.Specifically, have:
1. the utilization of sound connection parameter makes the intelligibility of synthetic statement improve 21.4%, promptly brings up to 89.9% by 68.5%.Wherein back one data are published on " intelligent machine research trends " the 6th phase in 1994 for the result of the national test and appraisal of this group participation.Last data is that this group adopts same method of testing to record.The reflection of audiometry team: during without sound connection parameter, the sensation that synthetic speech jumps out with giving a word of a word, after the employing sound connection parameter, the statement naturalness has been improved significantly.
2. the utilization of the fundamental tone outline line information of each syllable has improved naturalness and the intonation ability to express of exporting voice significantly, makes sentence intelligibility bring up to 99.6% simultaneously, and this is a measured result.
3. the utilization of each syllable signal intensity profile line has further improved naturalness and the tone ability to express of exporting voice.
3. the compatible isolated word input mode of phonetic entry mode, the practicality and the real feasibility of this vocoder have been strengthened significantly, the restriction that it makes the enforcement of this vocoder can break full syllable recognition technology difficulty in the continuous speech, and make it can be based on current mature technology, and in application, can adapt to language user not up to standard.
Further narrate the structure and the embodiment thereof of each unit below in conjunction with diagram:
Fig. 1 is the The general frame of the simplex mode voice communication system that is made of the transmission part of an identification synthesis type vocoder and the receiving unit of another similar vocoder
Fig. 2 is the syllabary structural drawing
Fig. 3 is the structured flowchart of voice recognition unit
Fig. 4 is the block diagram of prosodic analysis unit
Fig. 5 is the structured flowchart of synthesis unit
1. syllabary:
Among Fig. 1 the structure of syllabary as shown in Figure 2, it is one 24 * 35 * 6 a cubical array, three-dimensional subscript is represented the sequence number of initial consonant, simple or compound vowel of a Chinese syllable and tone respectively.A zero initial is wherein arranged in the initial consonant; To support in the simple or compound vowel of a Chinese syllable rhythm (be ZI, Ci, Si, Zhi, Chi, Shi, ri) single-row is a simple or compound vowel of a Chinese syllable, and does not belong to simple or compound vowel of a Chinese syllable i; Tone had only 4 kinds of tones originally, considered softly and suffixation of a nonsyllabic "r" sound, so arranged 6 elements.This programme has defined 1866 syllables, and each syllable has a sequence number, these sequence numbers just according to its sound, rhyme, tone and whether the suffixation of a nonsyllabic "r" be stored among the above-mentioned cubical array, do not have the element of corresponding syllables to be taken as 0 value in the cubical array.If the sequence number of a known syllable can be directly found the sound, rhyme, tone and the whether suffixation of a nonsyllabic "r" by this table, vice versa.Obviously, this syllabary is that syllable coding and decoding process is indispensable.Also in syllable identification and phonetic synthesis process, be used as the index of operating database simultaneously.
2. voice recognition unit:
The structure of the voice recognition unit among Fig. 1 as shown in Figure 3.
The purpose of voice recognition unit is that each syllable in the input voice is discerned automatically, so that encode.Use as vocoder, should consider that vocabulary is unrestricted, and wish to delay as far as possible little, therefore require recognition unit can quasi real time discern whole Chinese syllables.
Because the promotion of Chinese dictation machine development, complete pronunciation Chinese joint recognition technology is partly grown up, and among improving rapidly.Therefore many proven technique that implement of this programme can be for using for reference.The emphasis of this programme is to carry out structure and implement the design of approach at some specific (special) requirements of Chinese spoken language distinguishing and synthesis type vocoder, and one of them distinct issues is phonetic entry mode problems.
Use as vocoder, voice recognition unit thinks all that usually the mode of its input voice is continuous natural languages.Therefore, the full syllable recognizer also designs at complete pronunciation Chinese joint identification in the continuous speech in this programme.But, this programme considers that also the complete pronunciation Chinese joint in the continuous speech is more much more difficult than the identification of isolated syllable, for guaranteeing that various user can both reach sufficiently high syllable recognition correct rate, this programme has also designed another kind of input mode---isolated syllable or isolated word input mode, promptly both allowed word ground of a word to say, also allow speech ground of a speech to say,, also can when finding mistake, correct it with the simplest keyboard operation if indivedual mistakes still take place.The simplest so-called keyboard operation is meant with choosing that correct syllable in four candidate's syllables of numerical key from recognition result and replaces the main separation syllable.Like this,, add that accuracy can reach 99% after four candidate's syllables, have only 10% chance will play a numerical key error correction so, and the result can reach 99% accuracy if main separation syllable accuracy can reach 90%.Such situation reaches under present technical merit.Allowing with the polysyllabic word is unit when discerning, owing to can utilize the language model of speech one-level, its syllable discrimination is higher.Just time delay at this moment will increase to the persistence length of a word, is unsuitable for the full duplex voice communications applications.But, identification synthesis type vocoder unique distinction is that its bit rate is low especially, and tonequality can be improved arbitrarily, can be used widely based on half-duplex or simplex mode fully.
Structure shown in Figure 3 is the voice recognition unit embodiment of compatible above-mentioned two kinds of phonetic entry modes.Its hardware configuration is 32 high-speed digital signal processor subsystems by one or two parallel mode work, is equipped with that peripherals such as prime amplifier, antialiasing filter, D/A and A/D converter, display and simple keyboard constitute; Perhaps the part except that display and keyboard in the said equipment is made personal computer plug-in card form, cooperate the operating environment of personal computer and constitute.
Structure according to Fig. 3, input speech signal is after pre-service such as preposition amplification, anti-mixed filtering, mould/transformation of variables, pre-emphasis, begin to carry out feature extraction after detecting voice by the real time end point detecting method, the fundamental tone parameter of extracting is used for Tone recognition, and the acoustic feature vector sequence of extraction is used to carry out similarity and calculates to differentiate current syllable to belong to which no tuning joint.
Similarity calculating method bends method (DTW) or neural network method (NN) with compound hidden Markov model method (HMM) or weighting dynamic time, the most ripe wherein current method is the hidden Markov model method, but the neural network method still has the large development potentiality, and effect is fine when particularly combining with the HMM method.Semitone joint model with the major advantage of the method for the compound formation full syllable of later half syllable-based hmm model was before this programme adopted: compare with full syllable model independently, its parameter takies the calculated amount that memory space size, similarity calculate, the computing cost and the required training sample quantity of training pattern of model training can reduce one more than the order of magnitude, and not reducing accuracy of identification, this is the huge superiority that the regular structure of Chinese syllable is brought.
Tone recognition adopts the HMM method, and experiment has proved the correct recognition rata that can obtain more than 97%.
The selection of phonetic entry mode mainly is the end-point detection method.Following mask body is introduced three kinds of input modes:
1. continuous speech input mode; Adopt end-point detection 1 to realize, it adopts multi-threshold zero-crossing rate method to judge the beginning of a statement or phrase, when changing the syllable splitting algorithm over to, carry out acoustics and prosodic features extract real-time then, the syllable splitting algorithm is according to short-time energy and pure and impure differentiation result, at any time judge whether last syllable finishes and entered among next syllable, in case find current syllable end point, just notify at any time another with it the high speed signal processing subsystem of parallel processing carry out the phasic property degree and calculate.When carrying out the automatic segmentation syllable, also judge whether to have arrived the not tail of statement or phrase, in case arrive, begin above-mentioned whole process again.
2. isolated syllable or isolated word input mode: adopt end-point detecting method 2.It is also judged the beginning of each isolated syllable or isolated word, changes end point over to then and detect with multi-threshold zero-crossing rate method, carrying out acoustic feature and prosodic features simultaneously extracts, it also judges the starting point of each isolated syllable or isolated word with multi-threshold zero-crossing rate method, change the end point then over to and detect, begin to carry out feature extraction simultaneously.The end point detecting method only just can better be judged the syllable ending with short-time energy, in case find the end point to continue to detect whether have voice to begin again again, if this dead time surpasses certain threshold delta T 0(we get Δ T 0=0.2 second) still there are not voice to occur, just judge that this syllable or word finish, notify similarity calculated to calculate and discern immediately, certainly, also can in feature extraction, carry out similarity calculating for improving real-time.This algorithm can be judged current input automatically is the single-unit speech or uses polysyllabic word, discerns with which kind of recognition methods with decision.
3. semicontinuous polysyllabic word input mode: also adopt end-point detection 2 to realize, the advantage of noticing the polysyllabic word method of identification is to improve discrimination by the language statistics model of speech one-level, but has the bigger shortcoming of time delay.Therefore, in actual applications, can not use the polysyllabic word identification division in order to shorten time delay, only with isolated syllable identification division, but the input voice still can be to import with single syllable or polysyllabic word mode, just when pronunciation, note having slightly between per two adjacent syllables of the pronunciation of polysyllabic word and pause but be not longer than threshold delta T a little 0(at this moment we get Δ T=0.25 second) guarantees that simultaneously two pauses between the different speech are always greater than Δ T 0, this just can be equal to the syllable in the polysyllabic word with isolated syllable has substantially treated, and can detect sound connection parameter automatically.
Need to prove that semitone joint model used in above-mentioned three kinds of situations is different, semitone joint model must be more in the continuous speech input mode, divide carefullyyer, second kind of input mode taken second place, the full syllable recognition methods of the third input mode and common isolated syllable is basic identical, and forward and backward semitone joint model is respectively got about 100 just enough.Proof adopts semitone joint modelling effect to be better than initial consonant, rhythm pattern master by experiment, because influencing each other of sound, rhyme, tone, according to the similarity degree of acoustic feature, each initial consonant and each simple or compound vowel of a Chinese syllable are divided into a plurality of subclasses, each subclass generates a semitone joint model, and statistical property is more stable.
3. prosodic information processing unit:
The prosodic information processing unit relates to the content of unit such as prosodic analysis among Fig. 1, parameter coding, parameter decoding and prosodic parameter conversion, and cooperates with voice recognition unit and language synthesis unit, realizes prosodic features Parameter Extraction, compressed encoding and utilization.
Fig. 4 is the structured flowchart of prosodic analysis unit.It comprises four kinds of main prosodic features Parameter Extraction, i.e. duration parameters, fundamental tone parameter, loudness of a sound parameter and sound connection parameter.The details of aspects such as the effect of these prosodic parameters of narration below, extracting method, compressed encoding transmission and utilization:
1. duration parameters: code length 6 bits are the duration of the current syllable of unit representation with the frame, try to achieve in conjunction with end-point detection or syllable splitting method.Also can obtain current syllable voiced segments duration, voiceless sound section duration in conjunction with pure and impure diagnostic method in the pitch Detection.Major control voiced segments duration is consistent with the input syllable in synthesis unit, and the adjustment of voiceless sound section duration is adjusted by group speech rule.
2. fundamental tone parameter: the time-varying curve of voiced segments pitch period value in the syllable, (being called for short the fundamental tone outline line), it plays a decisive role to tone.But each has the fundamental tone outline line of tuning joint is not unalterable, and it is subjected to the influence of voice environment and intonation, and the coarticulation of adjacent syllable influence can make the fundamental tone outline line produce significantly variation in polysyllabic word especially, even becomes another kind of tone.Therefore, the fundamental tone outline line of each syllable is a kind of important prosodic parameter that voice quality is played a decisive role.This programme adopts the most general a kind of algorithm---and the center clipping correlation method carries out pitch period and detects, with auto-adaptive increment modulating-coding method or vector quantization coding method the fundamental tone outline line is carried out compressed encoding, perhaps only pass minimum value of pitch period and a maximal value, in order to the scope of control tone pitch.The fundamental tone parameter is used to control the fundamental tone outline line of the fundamental tone outline line of synthetic speech similar in appearance to the input syllable.
3. loudness of a sound parameter: the parameter of reflection loudness of a sound has two kinds, and a kind of is short-time energy, and a kind of is the short signal amplitude, and this programme is wherein a kind of according to the compositor type selecting.Every frame is asked a loudness of a sound parameter, the loudness of a sound parameter value of whole syllable promptly constitutes a level and smooth loudness of a sound with a curve that becomes, abbreviate the loudness of a sound outline line as, the loudness of a sound outline line of syllable to tone feel have certain effect, fluency to statement also has certain influence, and this programme has been considered to transmit the loudness of a sound profile information with compaction coding method, but in order to reduce number of coded bits, in using voice quality is required when not really high, we only get a typical loudness of a sound value.This parameter is used to control the loudness of a sound of synthetic syllable in synthesis unit.
4. sound joins parameter: code length 1 bit, it reflects whether current syllable and next syllable are linked to be a speech.This programme adopts automatic testing method to obtain this special prosodic parameter.Here it is in conjunction with end-point detection and syllable splitting method, calculates the duration of the pause of current syllable and next inter-syllable.When this pause duration less than a certain threshold delta T 0(this programme selects Δ T 0=0.2~0.25 second) time, think that promptly next syllable and this syllable belong to a speech together, putting sound connection parameter is 1, otherwise thinks and do not belong to same speech, putting sound connection parameter is 0.Sound connection parameter plays two effects at synthesis unit: a) rule of control inter-syllable coarticulation; B) whether the control inter-syllable pauses and pauses duration.
4. phonetic synthesis unit:
The structure of the phonetic synthesis unit among Fig. 1 as shown in Figure 5.
The phonetic synthesis unit is the main body of this vocoder receiving unit, and it must utilize transmitting terminal to pass the parameter that transfers, and quasi real time synthesizes the unrestricted Chinese speech of vocabulary.
It is the unlimited vocabulary Chinese speech compositor of synthesis unit with syllable or semitone joint that the phonetic synthesis unit is one in essence.Certainly also do not get rid of and establish a common phrase sound bank, directly synthetic phrases are so that further improve the naturalness of common-use words.As previously mentioned, the parameter that transmitting terminal transmits not only comprises the code of each syllable in the statement, and includes the intelligibility of some involutory one-tenth statements, the prosodic parameter that naturalness plays a decisive role.Therefore, as long as synthetic method can fully use these information, the output voice quality is at random to improve, and is not subjected to the restriction of channel.
The minimum requirement that the phonetic synthesis unit will satisfy is: it must synthesize various Chinese syllables commonly used in the foregoing Chinese characters spoken language, comprises 1632 er-suffix syllable and light tone syllables that the tuning joint arranged, use always.
Have high intelligibility (>90%) if guarantee the output voice, require synthesis unit to have basic rhythm adjustment capability: 1. He Cheng arbitrary syllable, its loudness of a sound, the duration of a sound can arbitrarily change and still can keep its high definition and good naturalness; 2. want to consider to a certain extent the synthetic polysyllabic word (comprising disyllabic word) of inter-syllable coarticulation influence, and can guarantee that the sharpness of polysyllabic word synthetic speech commonly used reaches more than 90%, naturalness reached more than 8.0 minutes.
Seek out that intelligibility is higher, naturalness is better exported voice, compositor is above-mentioned all require except that satisfying, and also must have more higher leveled rhythm adjustment capability: 1. the fundamental tone outline line of synthetic syllable allows arbitrarily to change and still can keep the high definition and the naturalness of synthetic syllable; 2. requiring the formant trajectory of the simple or compound vowel of a Chinese syllable part of synthetic syllable can hack, or abundant simple or compound vowel of a Chinese syllable arranged---the initial consonant transition section solves inter-syllable coarticulation influence for selecting for use.These two kinds of rhythm adjustment capabilities all are crucial for synthetic, the improvement of statement naturalness of high-quality polysyllabic word and the expression of intonation.Synthetic method can be used pitch synchronous splicing adding method, resonance peak synthetic method or linear prediction synthetic method.
Fig. 5 has provided a kind of structured flowchart that satisfies the unlimited vocabulary Chinese compositor of above-mentioned requirements.
In this compositor scheme, the full syllable parameter database can store in two ways, and a kind of is that as a whole storage done in each syllable, and another kind is to be decomposed into several public preceding semitone joint and back semitone storages, becomes full syllable by principle combinations when synthetic.The former helps guaranteeing the synthetic quality of syllable, and the latter can reduce the storage requirement effectively.Simultaneously, in order to synthesize polysyllabic word better, also stored the synthetic parameters of some coarticulation transition sections, it is to be influenced to produce to extract in the obvious situation about making a variation of resonance peak characteristic by a syllable after preceding monosyllabic afterbody is subjected in the various disyllabic words to obtain.The prosodic parameter that the prosodic parameter converting unit translates decoding unit among the figure is converted to various control informations according to rhythm regulation rule, plays a role in the synthetic process of speech and statement.
5. encode and decoding unit:
1. syllable coding: if coding unit is that unit encodes with sound, rhyme, tone with voice identification result, each syllable needs 14 bits, and if be that unit encodes with the syllable with aforesaid syllabary, each syllable is need 11 bits only, can save 3 bits.
2. the coding of prosodic parameter:
The dirigibility of the coding of prosodic parameter is very big, and the performance index that reached with identification synthesis type vocoder have direct relation, and we illustrate the coding effect with two typical embodiment:
A) a kind of lowest bitrate coding embodiment:
Here said lowest bitrate is meant that this vocoder result of implementation can guarantee that the output statement intelligibility reaches the lowest bitrate requirement under the situation more than 90%.For compression bit rate as far as possible, we select two most criticals from above-mentioned four kinds of prosodic parameters, one is sound connection parameter, code length 1 bit, another is a signal intensity, we select for use the PCM-A of this syllable maximum signal amplitudes to restrain conversion, get 5 bit quantization results, have only 6 bit prosodic parameters so altogether, add 11 bit syllable codes, 17 bits/syllable is calculated by the fastest speech rate of 5 syllables of per second altogether, and its bit rate still has only 85 bps.And the sentence intelligibility of the vocoder of implementing like this actual measurement is higher than 90%.LPC-10 vocoder of this and the U.S.'s 2.4 kilobits/second is suitable.Here used synthetic method adopts pitch synchronous splicing adding method, and utilizes inter-syllable coarticulation rule in some speech.What play a decisive role here is the utilization of sound connection information.
B) a kind of very low bit rate coding embodiment of high-quality speech output:
For making the output voice quality reach better quality, we select following four several prosodic parameters for use: the voiced segments duration of a sound 6 bits, loudness of a sound parameter (the same) 5 bits, fundamental tone outline line compressed encoding 25 bits, sound connection parameter 1 bit add syllable code 11 bits 48 bits/syllable altogether again.Synthetic method adopts pitch synchronous splicing adding method in the enforcement, and the output intelligibility of speech measured result of the identification synthesis type vocoder of Gou Chenging is 99.6% like this.What play a crucial role here is the utilization of fundamental tone outline line parameter, and it makes intonation true to nature, inter-syllable tone transitions smooth.
Below just provided demonstration how to select prosodic parameter, also can adjust neatly, to obtain the performance that hope reaches.About decoding unit, it is the inverse process of coding only, no longer goes out to give unnecessary details at this.

Claims (3)

1. Chinese spoken language distinguishing and synthesis type vocoder, comprise and send and receive two parts and the Chinese syllables that two parts are shared, during communication, it is that unit encodes with the syllable that the transmission of making a start partly adopts speech analysis and speech recognition technology will import voice, the receiving unit of receiving end is according to the syllable code string that receives synthetic speech again, what the syllable in the syllabary was provided with 1300 routines in the mandarin has a tuning joint, it is characterized in that:
1. send and reception is equipped with the prosodic information processing unit, with transmission with utilize prosodic information to guarantee the output voice quality;
2. have additional in the syllabary in general user's the mandarin spoken language may send because of dialect custom or language environment influence hundreds of other tuning joint is arranged, and er-suffix syllable and light tone syllable commonly used in the mandarin spoken language.
2. a kind of prosodic information disposal route in the Chinese spoken language distinguishing and synthesis type vocoder, the duration of a sound, the range value of its prosodic information processing unit each syllable in making a start to the input voice detect coding automatically, utilize the synthetic of these parameter control syllables in receiving end, it is characterized in that:
1. carry out the automatic detection and the coding of sound connection parameter making a start, whether belong to same polysyllabic word, utilize this parameter to divide speech so that carry out synthesizing of word and statement in receiving end to indicate current syllable and next syllable;
2. carry out pitch Detection and compressed encoding making a start, the fundamental tone parameter of receiving is changed the fundamental tone outline line of the fundamental tone outline line of the syllable that is synthesized with control similar in appearance to former input syllable in receiving end;
3. carry out the automatic detection and the compressed encoding of syllable signal intensity making a start, the intensive parameter of receiving changed in receiving end, with control the intensity profile line of synthetic syllable similar in appearance to the intensity profile line of former input syllable.
3. method according to claim 2 is characterized in that importing the mode of voice, compatible continuous speech input mode and based on the interrupted input mode of isolated syllable or isolated word, and receiving end all can utilize sound connection parameter to synthesize smooth statement.
CN94118778A 1994-12-06 1994-12-06 Chinese spoken language distinguishing and synthesis type vocoder Expired - Fee Related CN1085367C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN94118778A CN1085367C (en) 1994-12-06 1994-12-06 Chinese spoken language distinguishing and synthesis type vocoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN94118778A CN1085367C (en) 1994-12-06 1994-12-06 Chinese spoken language distinguishing and synthesis type vocoder

Publications (2)

Publication Number Publication Date
CN1122936A true CN1122936A (en) 1996-05-22
CN1085367C CN1085367C (en) 2002-05-22

Family

ID=5039006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN94118778A Expired - Fee Related CN1085367C (en) 1994-12-06 1994-12-06 Chinese spoken language distinguishing and synthesis type vocoder

Country Status (1)

Country Link
CN (1) CN1085367C (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029616A1 (en) * 2000-09-30 2002-04-11 Intel Corporation Method, apparatus, and system for bottom-up tone integration to chinese continuous speech recognition system
CN1758330B (en) * 2004-10-01 2010-06-16 美国电报电话公司 Method and apparatus for preventing speech comprehension by interactive voice response systems
CN101118541B (en) * 2006-08-03 2011-08-17 苗玉水 Chinese-voice-code voice recognizing method
CN104575506A (en) * 2014-08-06 2015-04-29 闻冰 Speech coding method based on phonetic transcription
CN106157948A (en) * 2015-04-22 2016-11-23 科大讯飞股份有限公司 A kind of fundamental frequency modeling method and system
CN112348073A (en) * 2020-10-30 2021-02-09 北京达佳互联信息技术有限公司 Polyphone recognition method and device, electronic equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029616A1 (en) * 2000-09-30 2002-04-11 Intel Corporation Method, apparatus, and system for bottom-up tone integration to chinese continuous speech recognition system
US7181391B1 (en) 2000-09-30 2007-02-20 Intel Corporation Method, apparatus, and system for bottom-up tone integration to Chinese continuous speech recognition system
CN1758330B (en) * 2004-10-01 2010-06-16 美国电报电话公司 Method and apparatus for preventing speech comprehension by interactive voice response systems
CN101118541B (en) * 2006-08-03 2011-08-17 苗玉水 Chinese-voice-code voice recognizing method
CN104575506A (en) * 2014-08-06 2015-04-29 闻冰 Speech coding method based on phonetic transcription
CN106157948A (en) * 2015-04-22 2016-11-23 科大讯飞股份有限公司 A kind of fundamental frequency modeling method and system
CN106157948B (en) * 2015-04-22 2019-10-18 科大讯飞股份有限公司 A kind of fundamental frequency modeling method and system
CN112348073A (en) * 2020-10-30 2021-02-09 北京达佳互联信息技术有限公司 Polyphone recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN1085367C (en) 2002-05-22

Similar Documents

Publication Publication Date Title
US6161091A (en) Speech recognition-synthesis based encoding/decoding method, and speech encoding/decoding system
CN1154086C (en) CELP transcoding
US4975957A (en) Character voice communication system
CN113327627B (en) Multi-factor controllable voice conversion method and system based on feature decoupling
JP3446764B2 (en) Speech synthesis system and speech synthesis server
Li et al. Ppg-based singing voice conversion with adversarial representation learning
CN114746935A (en) Attention-based clock hierarchy variation encoder
CN116018638A (en) Synthetic data enhancement using voice conversion and speech recognition models
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN1085367C (en) Chinese spoken language distinguishing and synthesis type vocoder
CN1811912A (en) Minor sound base phonetic synthesis method
KR100373329B1 (en) Apparatus and method for text-to-speech conversion using phonetic environment and intervening pause duration
Wang et al. Phonetic segmentation for low rate speech coding
CN114220414A (en) Speech synthesis method and related device and equipment
Peiró Lilja et al. Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding
CN1979636B (en) Method for converting phonetic symbol to speech
Crochiere et al. Speech processing: an evolving technology
JP3552200B2 (en) Audio signal transmission device and audio signal transmission method
CN117636842B (en) Voice synthesis system and method based on prosody emotion migration
Holmes Towards a unified model for low bit-rate speech coding using a recognition-synthesis approach.
Felici et al. Very low bit rate speech coding using a diphone-based recognition and synthesis approach
Chen et al. A 0.75 Kbps speech codec using recognition and synthesis schemes
CN115910023A (en) Low-resource Laos voice synthesis method based on fine-grained prosody modeling
Reddy et al. Use of segmentation and labeling in analysis-synthesis of speech
CN115662390A (en) Model training method, rhythm boundary prediction method, device and electronic equipment

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C19 Lapse of patent right due to non-payment of the annual fee
CF01 Termination of patent right due to non-payment of annual fee