CN1924994B

CN1924994B - Embedded language synthetic method and system

Info

Publication number: CN1924994B
Application number: CN2005100863115A
Authority: CN
Inventors: 陶建华; 张皖志
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Priority date: 2005-08-31
Filing date: 2005-08-31
Publication date: 2010-11-03
Anticipated expiration: 2025-08-31
Also published as: CN1924994A

Abstract

The invention discloses an embedded speech synthesis method and system, which are used for the operating system of a handheld digital mobile device, and convert any text string received or input by the system into speech output. Taking the consonants and finals in Chinese as the basic unit of the synthesis system and the speech library; first create the original speech database based on the consonants and finals, then quantize and compress the original speech database based on the contextual environment attributes and acoustic features of the consonants and finals samples, and finally The quantized and compressed corpus is encoded and compressed by a speech compression algorithm to obtain a final compressed speech library. According to the method provided by the invention, the compressibility of the synthesis system can be improved, thereby reducing the system resource occupied by it under the embedded platform, and at the same time, the synthesis result can maintain better naturalness and intelligibility.

Description

A kind of embedded language synthetic method and system

Technical field

Present invention relates in general to a kind of phoneme synthesizing method and speech synthesis system, relate in particular to a kind ofly, comprise the phoneme synthesizing method and the system of mobile phone and palm PC with towards portable hand-held digital mobile equipment.

Background technology

Speech synthesis system claims text-to-speech system (tts system) again, and its major function is that any text strings of input that computing machine is received converts voice output to.It is generally acknowledged that the functional module of speech synthesis system can be subdivided into three main ingredients: text analysis model, rhythm generation module and acoustic module.Gradually become the mainstream technology in phonetic synthesis field in recent years based on the synthetic method of extensive corpus, the basic thought of its realization is to select voice unit to splice according to specific rule from a large amount of natural flows, and voice unit carried out the adjustment and the modification of rhythm characteristic, thereby obtain satisfactory synthetic speech.For guaranteeing that synthetic result possesses abundant rhythm performance, its sound storehouse scale often reaches hundreds of megabyte.Adopting such method is not problem to the configuration when CPU processing power, internal memory and other resources of front desktop PC, but for the portable hand-held digital mobile equipment (comprising mobile phone and palm PC) of current main-stream, its storage space and arithmetic capability then can't satisfy practical requirement far away.

Present existing embedded speech synthesis system all adopts syllable in the Chinese as the elementary cell of synthesis system, on the basis of existing omnibus language sound bank, carry out cluster apart from all samples to each syllable according to the acoustics between sample, only remain as such other representative with the cluster barycenter, discard inner other samples of this classification simultaneously, thereby realize compression sound bank.The sound bank that obtains based on said method needs the storage space more than the 1M at least, if continue it is compressed, then the sample number that remains of each syllable will significantly reduce, thereby cause the naturalness of synthetic speech and tonequality significantly to descend.The scale of above-mentioned embedded synthesis system remains bigger expense for the handheld device of current main-stream, the resource of its requirement is cost prohibitive for its importance on handheld device, and the user is difficult to accept.Therefore, need a kind of improved method, be used under embedded platform, realizing taking the less speech synthesis system of resource.

Summary of the invention

Technical matters to be solved by this invention provides a kind of Chinese speech synthetic method and system that can use on portable hand-held digital mobile equipment, it takies few system resource, can make synthetic result keep naturalness and intelligibility preferably simultaneously.

For achieving the above object, the invention provides a kind of embedded language synthetic method, the any text strings that is used for the input that hand-held digital mobile device operating system receives system converts voice output to, and it is with the elementary cell of the sound mother in the Chinese as synthesis system; The quantification compression process of sound bank is divided into following three steps:

A. create raw tone storehouse based on the sound mother.

B. context environmental attribute and the acoustic feature based on the female sample of sound quantizes compression to described raw tone storehouse.

C. the corpus after by voice compression algorithm described quantification being compressed carries out encoding compression, obtains final compressing voice library.

Above-mentioned embedded language synthetic method is characterized in that: described as follows based on raw tone storehouse, the female unit of sound constructive process: the simple or compound vowel of a Chinese syllable that each initial consonant or simple or compound vowel of a Chinese syllable in the sound bank are adjacent according to syllable inside or the pronunciation characteristic of initial consonant are further classified.

Above-mentioned embedded language synthetic method is characterized in that: be divided into following six steps with the sound mother as the quantification compression process of the sound bank of primitive:

A. create the sound bank of a sky.

B. from the raw tone storehouse, read in whole original samples of a sound mother at every turn.

C. the female sample of sound is roughly selected step, is used for rejecting the influence of artificial phoneme such as all people that recorded of this sound mother, sound pick-up outfit and sound bank mark and the distortion sample that left behind in sound bank.

D. the female sample cluster of sound step is used for female sample evidence segment5al feature of the sound after described the roughly selecting and the further cluster of Supersonic section feature, remains with the representative as such of the barycenter of each class after the cluster, abandons the female samples of all the other sound.

E. the female sample of whole barycenter sound is deposited in the newly-established compressing voice library.

F. judge whether to handle the female unit of whole sound, if then the off-line subprogram finishes; If not, then return step B repeating step B, C, D, E, up to handling whole original corpus.

Above-mentioned embedded language synthetic method is characterized in that: the female sample of described sound is roughly selected step and is comprised following three steps:

A. the average rhythm characteristic of the female sample of the inner original sound of statistical analysis unit weeds out and departs from average characteristics sample far away excessively; The rhythm characteristic of considering comprises fundamental curve, the duration of a sound and the average energy of sample.

B. investigate the female sample of sound in the sound storehouse in original flow with the degree of adjacent cells coarticulation, weed out the strong excessively sample of coarticulation.

C. analyze the dystimbria degree of the female sample of sound, weed out the relatively poor sample of tonequality.

Above-mentioned embedded language synthetic method is characterized in that: described sample cluster step comprises following three steps:

A. the female unit of the sound step of presorting is used in conjunction with the context environmental attribute of sample sample being presorted; Adopt classification and regression tree (CART) method to classify, generate a CART tree for each sound is female.

B. simple or compound vowel of a Chinese syllable cluster step is used for the sample on each leafy node of the CART tree of simple or compound vowel of a Chinese syllable is carried out cluster; The fundamental curve that is characterized as simple or compound vowel of a Chinese syllable that cluster is selected for use only keeps the barycenter of every class, abandons all the other samples.

C. initial consonant cluster step is used for the sample on each leafy node of the CART tree of initial consonant is carried out cluster; The 12 rank Mel frequency marking cepstrum parameters (MFCC) that are characterized as initial consonant that cluster is selected for use.

Said method adopts the sound mother as the primitive compressibility of elevator system significantly, can reduce the acoustics redundance in the sound storehouse as far as possible, thereby realize high efficiency compression under the prerequisite of naturalness that keeps synthetic result and intelligibility.This method is compared with the synthetic method based on syllable under equal sound bank scale, and performance does not almost have difference.

For better realizing above-mentioned purpose, the present invention also provides a kind of embedded speech synthesis system, be applied to hand-held digital mobile device operating system, it is by speech synthesis system off-line part, and text load module, the online part of speech synthesis system and audio digital signals output module are formed; Wherein, the output terminal of speech synthesis system off-line part and text load module is electrically connected with the online part of speech synthesis system, and the output terminal of the online part of speech synthesis system is electrically connected with the input end of audio digital signals output module.

Described embedded speech synthesis system, its described speech synthesis system off-line part, only when working offline state, uses this speech synthesis system, only be used to generate the compressing voice library that to use when this synthesis system works online, the speech synthesis system off-line partly comprises the raw tone storehouse, and the raw tone storehouse comprises the raw tone through the energy consolidation that records.

Described embedded speech synthesis system, the online part of its described speech synthesis system comprises following module:

A. text analysis model is used for the text of described input is carried out the analysis on the format and content and is converted into sound auxiliary sequence string; Adhere to a series of relevant prosodic informations for each sound mother simultaneously;

B. rhythm prediction module, be used to receive the described sound auxiliary sequence string that adheres to prosodic information, utilize statistical model to dope the target rhythm value corresponding according to prosodic information, comprise sound mother's the duration of a sound, fundamental curve and average energy with it, and with it attached on the sound mother;

C. waveform concatenation module, be used to receive the described sound auxiliary sequence string that adheres to target rhythm value, from described compressing voice library, choose the sample sequence number the most approaching according to the prosodic information that described sound auxiliary sequence carries with target rhythm value, and utilization and the corresponding decompression algorithm of described encryption algorithm restore and the pairing voice signal of described sample sequence number, and it is stitched together, make smoothing processing in splicing place;

D. tone decoding module; And

E. compressing voice library;

Wherein, the text load module is electrically connected with text analysis model, rhythm prediction module, waveform concatenation sequence of modules; Speech synthesis system off-line part is electrically connected with compressing voice library, tone decoding module, waveform concatenation sequence of modules; The output terminal of waveform concatenation module is electrically connected with the audio digital signals output module, and the audio digital signals output module is used to play the described audio digital signals that is spliced into.

According to the embedded speech synthesis system that said method is set up, can under hand-held digital mobile device operating system, use fully, and shared resource all is no more than the ability that said handheld device itself is possessed with the computation complexity that needs.

The present invention is further described below in conjunction with drawings and Examples, will describe step of the present invention and the process of realizing better to the detailed description of each building block of system in conjunction with the drawings.

Description of drawings

Accompanying drawing 1 is based on the structural representation of sound mother's embedded speech synthesis system;

Accompanying drawing 2 sound mother pronunciation storehouses quantize the compression process synoptic diagram;

The female sample rougher process of accompanying drawing 3 sound synoptic diagram;

The female sample cluster process of accompanying drawing 4 sound synoptic diagram.

Embodiment

In accompanying drawing 1, in a preferred embodiment of the invention, embedded speech synthesis system of the present invention is arranged in a kind of operating system of palm PC, this embedded speech synthesis system comprises: speech synthesis system off-line part 1, the online part 3 of palm PC text load module 2, speech synthesis system and the audio digital signals output module 4 that are connected in turn.

Wherein, 1 of speech synthesis system off-line part is used when this speech synthesis system works offline state, only is used to generate the compressing voice library b that need use when this synthesis system works online.Wherein raw tone storehouse a comprises the raw tone through the energy consolidation that records, and the process that is generated compressing voice library b by raw tone storehouse a off-line comprises: sound mother pronunciation storehouse foundation step 70, sound mother pronunciation storehouse quantize compression step 80 and sound bank coding/packaging step 90.

In step 70, at first utilize speech recognition tools bag HTK that automatic segmentation is carried out in the raw tone storehouse that records, to obtain the boundary position information of sound mother pronunciation segment in original statement, adopt the fundamental detection toolmark to go out the peak point positional information of speech waveform simultaneously, and by hand the boundary position and the peak point position of described automatic acquisition are proofreaded; Then in the sound bank behind described cutting mark, the simple or compound vowel of a Chinese syllable that each initial consonant or simple or compound vowel of a Chinese syllable are adjacent according to its syllable inside or the pronunciation characteristic of initial consonant are further classified: initial consonant is divided into four classes, after connect opening and exhale, after connect class of syllables with i as the final or a final beginning with i, the back engages mouthful exhales, after connect pinch and mouthful to exhale; Simple or compound vowel of a Chinese syllable is divided into nine classes, the preceding unaspirated stop that connects, the preceding vent plug sound of picking, preceding unaspirated affricate, preceding vent plug fricative, preceding mute fricative, the fricative of preceding sending and receiving sound, the preceding nasal sound that connects, preceding edge fit sound, the zero initial simple or compound vowel of a Chinese syllable of connecing picked of connecing.Total initial consonant is 21 in the Chinese, and 43 of simple or compound vowel of a Chinese syllable then produce 471 of the female unit of sound of environmental correclation, altogether with the elementary cell of the female unit of sorted sound as sound bank.Simultaneously in conjunction with syntax analysis to the original statement text, draw the high-rise prosodic information of the female sample of each sound, comprise: with rhythm/initial consonant type and the ID of current sound/simple or compound vowel of a Chinese syllable with syllable, preceding syllable rhythm parent type and ID, back syllable initial consonant type and ID, the accent shape of the female place of sound syllable, shape transferred in preceding syllable, shape transferred in back syllable, (rhythm level comprises rhythm speech to the relative position of low level rhythmite time high-level relatively rhythm level, prosodic phrase, statement, relative position is included in the head of level, in, tail), the rhythm speech of syllable under the sound mother, prosodic phrase length (is unit with the syllable number), the length that the front and back of the affiliated syllable of sound mother are quiet section.Described all information are saved within the file, as this sound mother's message file.All sound mothers' original waveform file and message file are formed sound mother pronunciation storehouse jointly.

In step 80, as shown in Figure 2, be divided into following six steps as the quantification compression process of the sound bank of primitive with the sound mother:

Step 100, the compressing voice library of a sky of program creation.

Step 110, each whole original samples that from the raw tone storehouse, read in a sound mother.

Step 120, the female sample of sound is roughly selected step, as shown in Figure 3, is used for rejecting the distortion sample that this sound mother left behind at sound bank.There are a large amount of samples comparatively unusual from acoustic feature in the recorded influence of artificial phoneme such as people, sound pick-up outfit and sound bank mark in the sound storehouse.When sound bank was larger, the probability that these sounds are selected was less, influenced less to synthetic result.Can work as to the sound bank scale hour, the sample of the distortion that left behind then is easy to be selected and is used for synthetic speech, thereby significantly reduces synthetic result's stability, also will take valuable storage space simultaneously.Present embodiment adopts following three kinds of filter criteria successively, automatically sound bank is carried out prescreen, weeds out labile factor wherein.Wherein step 200 is used to read in whole samples of the female unit of certain sound.Step 210 is read in some samples of this unit.Step 220 is used to judge whether the sample that described step 210 is read in satisfies rhythm abnormality degree criterion.The rhythm factor of Kao Lving comprises the duration of a sound, fundamental curve and the energy of sample herein.The rhythm abnormality degree (ProsodicSalience) that defines i sample is:

PS (i) = \frac{ω_{1} D_{d} (i) + ω_{2} D_{p} (i) + ω_{3} D_{e} (i)}{ω_{1} + ω_{2} + ω_{3}} - - - (1)

Wherein each sub-abnormality degree is:

D_{d} (i) = {(\frac{d (i) - \overset{&OverBar;}{d}}{\overset{&OverBar;}{d}})}^{2} - - - (2)

D_{p} (i) = {(\frac{p (i) - \overset{&OverBar;}{p}}{\overset{&OverBar;}{p}})}^{2} - - - (3)

D_{e} (i) = {(\frac{e (i) - \overset{&OverBar;}{e}}{\overset{&OverBar;}{e}})}^{2} - - - (4)

D (f), p (i) and e (i) are respectively the duration of a sound, fundamental frequency average and the average energy of i sample,

With

Be respectively the average of all sample individual features of this primitive.The weights ω of each sub-abnormality degree ₁, ω ₂And ω ₃Draw according to experiment.To arbitrary sample i, right

X, { e} is if having for d, p for x ∈

D _x(i)＞T _x (5)

Or

PS(i)＞T (6)

Then delete this sample.Wherein Tx and T are respectively the threshold value of each sub-rhythm abnormality degree and total rhythm abnormality degree.This criterion can reject the duration of a sound or peak point marks the sample of makeing mistakes, and the energy that human factor causes in the Recording Process is crossed weak or strong excessively sample.Step 230 is used to judge whether the sample that described step 210 is read in satisfies the degree of adhesion criterion.This criterion investigate sample in sound storehouse in original flow with the degree of adjacent cells coarticulation.Concerning based on the system in little sound storehouse, splicing place is particularly serious by the discontinuous tonequality loss that causes of spectrum, and to reject the stronger sound of degree of adhesion be a kind of feasible scheme as far as possible building the storehouse stage.The degree of adhesion (Context Dependency) that defines i sample is:

CD (i) = \frac{ω_{l} {\overset{&OverBar;}{e}}_{l} (i) + ω_{r} {\overset{&OverBar;}{e}}_{r} (i)}{ω_{l} + ω_{r}}

(7)

Wherein

With Be respectively the average energy of the left and right boundary of sample, can determine its weights according to the acoustic feature of unit.In the present embodiment, plosive and affricate are made ω ₁Be 0.Similarly, if the CD (i) of sample i then rejects this sample greater than certain threshold value T.Step 240 is used to judge whether the sample that described step 210 is read in satisfies dystimbria degree criterion.The recording people because some sample tonequality that tired or other psychological factor may cause recording occurs unusually, shows as gas sound, whisper in sb.'s ear or the tangible emotion of mixing in the process of long-term recording.These sounds often appear at sentence ending place, and energy is on the weak side, and the periodicity of vowel is relatively poor.To sample i, define its dystimbria degree (QualityDistortion) and be:

QD (i) = \frac{n_{peak} (i)}{\overset{&OverBar;}{e} (i) \cdot dur (i)}

(8)

N wherein _Peak(i) be the number of this sample peak point,

Be average energy, dur (i) is the duration of a sound of this sample.If the CD (i) of sample i then rejects this sample greater than certain threshold value T.In step 250,, then it is retained in the compressing voice library if sample satisfies described three criterions.Step 260 judges whether to handle all samples in the female unit of this sound, if not, then returns step 210, up to handling all samples; If then implementation step 270.Step 270 judges whether to handle the female unit of all sound, if not, then returns step 200; If then sample is roughly selected step 120 end.

Step 130, the female sample cluster of sound step, as shown in Figure 4, female sample evidence segment5al feature of sound and the further cluster of Supersonic section feature after being used for described step 120 roughly selected, remain with the representative of the barycenter of each class after the cluster, abandon the female samples of all the other sound as such.At first the sound mother is presorted, respectively the sound mother after presorting is compressed based on the further cluster of acoustic feature separately then, thereby keep the diversity of compressing voice library mid feature and Supersonic section feature.Wherein step 300 is used to read in whole samples of the female unit of certain sound.Step 310 is presorted to sample based on the phonology environment attribute, and the CART method in the present embodiment in the employing data mining field is as classification tool, and the decision attribute of choosing is based upon on the contextual description, comprising:

With rhythm/initial consonant type and the ID of current sound/simple or compound vowel of a Chinese syllable with syllable.

Preceding syllable rhythm parent type and ID

Back syllable initial consonant type and ID

The accent shape of the female place of sound syllable, shape transferred in preceding syllable, and shape (comprising high and level tone, rising tone, last sound, falling tone, five kinds softly) transferred in back syllable.

The relative position of low level rhythmite time high-level relatively rhythm level, rhythm level comprises rhythm speech, prosodic phrase, statement.Relative position be included in level head, in, tail.

The rhythm speech length of syllable under the sound mother, prosodic phrase length is unit with the syllable number.

The length that the front and back of the affiliated syllable of sound mother are quiet section.

And select the characteristic parameter of 12 rank Mel frequency marking cepstrum parameters (MFCC) for use as the female unit of sound, select mahalanobis for use apart from the distance of coming between computing unit.Unit M, the distance definition of N such as Eq. (9)

dis (M, N) = Σ_{i = 1}^{| M |} Σ_{j = 1}^{12} {[P_{ij} (M) - P_{(i \cdot \frac{| M |}{| N |}) j} (N)]}^{2} - - - (9)

P wherein _Ij(M) be j MFCC parameter of i frame, | M| is the frame number of M.During actual computation, the MFCC of transition section between the inner sound mother of syllable is also contained within sound mother's the parameter vector, purpose is better modeling to be carried out in the coarticulation between the sound mother, makes classification results responsive more to the sound mother who is adjacent.Utilize CART training tool wagon to generate a CART tree for each sound is female, number of samples is controlled between the 50-100 on the leafy node.Step 320 judges that active cell is initial consonant or simple or compound vowel of a Chinese syllable, if initial consonant then adopts the MFCC parameter of initial consonant sample to come the sample on this initial consonant CART leaf node is carried out cluster by step 330; If simple or compound vowel of a Chinese syllable then adopts the fundamental curve of simple or compound vowel of a Chinese syllable sample to come the sample on this simple or compound vowel of a Chinese syllable CART leaf node is carried out cluster by step 340.Step 350 judges whether to handle the female unit of all sound, if then sample cluster step 130 finishes; If not, then return step 300, handle the female unit of other sound.

Step 140 keeps barycenter sample on the female unit of described all sound CART leaf nodes to final compressing voice library, abandons all other samples.

Step 150 judges whether to handle the female unit of whole sound, if not, then returns step 110, and repeating step 110,120,130,140 and 150 is up to handling the female unit of whole sound; If then sound mother pronunciation storehouse quantizes compression step 80 end.

In step 90, the female sample of sound in the sound bank that described quantification was compressed is compressed into the littler voice snippet that takes up room by certain voice compression algorithm, and the message file of generation in the wave file and 70 after will encode in some way is organized into the form of a file.In an embodiment of the present invention, the compressing voice library packing method that adopts in the step 90 is the form that is combined into a file with the voice code word of certain rule after with encoding compression, and the index of this compressing voice library is to represent different sound mothers' symbol to set up according to being used for.In an embodiment of the present invention, the voice compression algorithm of the compressing voice library that is adopted can be that any one can state handheld device resource requirement (comprising storage space and computation complexity) and can reach the algorithm that the sense of hearing requires (user is satisfied), for example: G.723.1 wait to have voice compression algorithm low code check, that in communication system, extensively adopt, perhaps other have the voice coding/decoding algorithms of high compression rate and low distortion, as long as its computational complexity and memory requirement can move on described handheld device.Can generate compressing voice library b by step 90, system off-line part of module 1 power cut-off so far.

As shown in Figure 1, text load module 2 receives the text of input, and in an embodiment of the present invention, system provides can be for the interface of handwriting input, and the writing pencil that the user can select to adopt palm PC to carry is imported text to be synthesized voluntarily; Also can select to synthesize whole file by the mode of opening text, the several rows that the user also can use writing pencil to select in the file are synthetic separately.

The online part 3 of speech synthesis system comprises text analysis model 20, rhythm prediction module 30, waveform concatenation module 40, tone decoding module 60 and the compressed voice library module b that is connected in turn again.Wherein, text analysis model 20 can receive the input of textual form, will import Chinese character by the format and content of analyzing input text and convert corresponding sound auxiliary sequence string to; Adhere to a series of relevant prosodic informations for each sound mother simultaneously.Rhythm prediction module 30 is used to receive the described sound auxiliary sequence string that adheres to prosodic information, utilize statistical model to dope the target rhythm value corresponding according to prosodic information with it, the duration of a sound, fundamental curve and the average energy that comprise the sound mother, and with it attached on the sound mother.Waveform concatenation module 40, be used to receive the described sound auxiliary sequence string that adheres to target rhythm value, from described compressing voice library, choose the sample sequence number the most approaching according to the prosodic information that described sequence is carried with target rhythm value, and utilization and the corresponding decompression algorithm of described encryption algorithm restore and the pairing voice signal of described sample sequence number, and it is stitched together, make smoothing processing in splicing place.

Audio digital signals output module 4 is used to play the described audio digital signals that is spliced into.

The present invention relates to a kind of phoneme synthesizing method and system, can promote the compressibility in synthetic speech system sound storehouse under the embedded platform based on this method, thereby reduce its system resource shared under embedded platform greatly, can make synthetic result keep naturalness and intelligibility preferably simultaneously.

The present invention is using on the palm PC, and all phonetic functions all can be enabled on handheld device or be closed at any time.When the not enabled phonetic function, the various functions of former handheld device will not be affected.

The foregoing description is preferred embodiment of the present invention, and application of the present invention is not limited only to palm PC, also may be used on multiple hand-held mobile device.According to main design of the present invention, those of ordinary skills all can produce multiple similar or of equal value application, and therefore, protection of the present invention should be as the criterion with the protection domain of claim.

Claims

1. An embedded speech synthesis method, which is used for the operating system of a handheld digital mobile device, converts any text string received or input by the system into a speech output, and is characterized in that: the sound and final vowels in Chinese are used as the synthesis system and speech The basic unit of the library; the quantization and compression process of the speech library is divided into the following three steps:

A. Create the original voice bank based on the consonants of the sound;

B. based on the context attribute and the acoustic feature of the final and final sample, quantize and compress the original speech library;

C. Encoding and compressing the quantized and compressed speech library through a speech compression algorithm to obtain a final quantized and compressed speech library.

2. the embedded method for speech synthesis according to claim 1, is characterized in that: described A step, based on the original phonetic bank of initials and finals of a Chinese syllable unit, the creation process is as follows: to each initial or Chinese simple or compound vowel of a Chinese syllable in the voice bank adjacent to it according to the syllable interior The pronunciation characteristics of the finals or initials are further classified.

3. embedded speech synthesis method according to claim 1, it is characterized in that: described B step, the quantization compression process of the phonetic storehouse that takes the consonant of a sound as primitive is divided into following six steps:

A. Create an empty voice library;

B. read in all original samples of a sound or final from the original speech bank at every turn;

C. The rough selection step of the final and final samples is used to remove all the distortion samples remaining in the voice database under the influence of the human phoneme marked by the recording person, the recording equipment and the voice database in the final and final vowels;

D. Acoustic and final sample clustering step, for further clustering the rough-selected acoustic and final samples according to segment features and supersegment features, and retain the centroid of each class as the representative of the class after clustering , discarding the rest of the final and final samples;

E. deposit all centroid sound vowel samples into the newly established compressed speech bank;

F. judge whether to process complete whole consonant unit, if yes, then the quantization compression process of speech storehouse finishes; If not, then return to step B and repeat steps B, C, D, E, until process complete whole original speech storehouse.

4. embedded speech synthesis method according to claim 3, is characterized in that: described C step, the consonant of a simple or final vowel sample rough selection step comprises following three steps:

A. Statistically analyze the average prosodic characteristics of the original sound and final samples in the unit, and remove samples that deviate too far from the average characteristics; the considered prosodic characteristics include the fundamental frequency curve, sound length and average energy of the samples;

B. Investigate the degree of co-pronunciation of the consonant and final samples in the sound bank with adjacent units in the original language flow, and remove the samples with too strong co-pronunciation;

C. Analyze the sound quality abnormality of the final and final samples, and remove the samples with poor sound quality.

5. embedded speech synthesis method according to claim 3, is characterized in that: described D step, sample clustering step comprises following three steps:

A. The unit pre-classification step of the final and final vowels is used to pre-classify the samples in conjunction with the context environment attributes of the samples; the Classification and Regression Tree (CART) method is used to classify, and a CART tree is generated for each final and final vowel;

B. final syllable clustering step, for clustering the samples on each leaf node of the CART tree of final syllables; the feature selected by clustering is the fundamental frequency curve of final syllables, only retaining the centroid of each class, and discarding all the other samples;

C. initial consonant clustering step, for the sample on each leaf node of the CART tree of initial consonant is clustered; The feature that clustering selects for use is the 12th order Mel frequency standard cepstrum parameter (MFCC) of initial consonant.

6. an embedded speech synthesis system, it uses the embedded speech synthesis method as claimed in claim 1, can be applied under the handheld digital mobile device operating system, it is characterized in that:

It is composed of the offline part of the speech synthesis system, the text input module, the online part of the speech synthesis system and the digital voice signal output module; wherein, the output end of the offline part of the speech synthesis system and the text input module is electrically connected with the online part of the speech synthesis system, and the speech synthesis system The output end of the online part is electrically connected to the input end of the digital voice signal output module;

The embedded speech synthesis system, wherein: the offline part of the speech synthesis system is only used when the speech synthesis system is working offline, and is only used to generate the compressed speech library that the synthesis system needs to use when it is working online. The offline part of the synthesis system includes the original speech library, which includes the recorded original speech after energy normalization.

7. embedded speech synthesis system as claimed in claim 6, is characterized in that: described speech synthesis system online part, comprises following module:

A. text analysis module, is used to carry out the analysis on the text of described input and content and converts it into the consonant sequence string; Attach a series of relevant prosody information for each consonant and final consonant simultaneously;

B. The prosody prediction module is used to receive the consonant sequence string attached to the prosodic information, and use the statistical model to predict the corresponding target prosodic value according to the prosodic information, including the sound length, fundamental frequency curve and average energy of the consonants, and attach it to the consonant;

C. waveform splicing module, for receiving the consonant sequence string attached to the target prosodic value, and selecting the sample sequence number closest to the target prosodic value from the compressed speech library according to the prosody information carried by the consonant sequence, and Using the decompression algorithm corresponding to the encoding algorithm to restore the speech signal corresponding to the sample sequence number, splicing it together, and performing smoothing at the splicing place;

D. Speech decoding module; and

E. Compressed voice library;

Among them, the text input module is electrically connected to the text analysis module, prosody prediction module, and waveform splicing module in sequence; the offline part of the speech synthesis system is electrically connected to the compressed voice library, speech decoding module, and waveform splicing module in sequence; the output of the waveform splicing module is connected to the digital The voice signal output module is electrically connected, and the digital voice signal output module is used to play the spliced digital voice signal.