Summary of the invention
Technical matters to be solved by this invention provides a kind of Chinese speech synthetic method and system that can use on portable hand-held digital mobile equipment, it takies few system resource, can make synthetic result keep naturalness and intelligibility preferably simultaneously.
For achieving the above object, the invention provides a kind of embedded language synthetic method, the any text strings that is used for the input that hand-held digital mobile device operating system receives system converts voice output to, and it is with the elementary cell of the sound mother in the Chinese as synthesis system; The quantification compression process of sound bank is divided into following three steps:
A. create raw tone storehouse based on the sound mother.
B. context environmental attribute and the acoustic feature based on the female sample of sound quantizes compression to described raw tone storehouse.
C. the corpus after by voice compression algorithm described quantification being compressed carries out encoding compression, obtains final compressing voice library.
Above-mentioned embedded language synthetic method is characterized in that: described as follows based on raw tone storehouse, the female unit of sound constructive process: the simple or compound vowel of a Chinese syllable that each initial consonant or simple or compound vowel of a Chinese syllable in the sound bank are adjacent according to syllable inside or the pronunciation characteristic of initial consonant are further classified.
Above-mentioned embedded language synthetic method is characterized in that: be divided into following six steps with the sound mother as the quantification compression process of the sound bank of primitive:
A. create the sound bank of a sky.
B. from the raw tone storehouse, read in whole original samples of a sound mother at every turn.
C. the female sample of sound is roughly selected step, is used for rejecting the influence of artificial phoneme such as all people that recorded of this sound mother, sound pick-up outfit and sound bank mark and the distortion sample that left behind in sound bank.
D. the female sample cluster of sound step is used for female sample evidence segment5al feature of the sound after described the roughly selecting and the further cluster of Supersonic section feature, remains with the representative as such of the barycenter of each class after the cluster, abandons the female samples of all the other sound.
E. the female sample of whole barycenter sound is deposited in the newly-established compressing voice library.
F. judge whether to handle the female unit of whole sound, if then the off-line subprogram finishes; If not, then return step B repeating step B, C, D, E, up to handling whole original corpus.
Above-mentioned embedded language synthetic method is characterized in that: the female sample of described sound is roughly selected step and is comprised following three steps:
A. the average rhythm characteristic of the female sample of the inner original sound of statistical analysis unit weeds out and departs from average characteristics sample far away excessively; The rhythm characteristic of considering comprises fundamental curve, the duration of a sound and the average energy of sample.
B. investigate the female sample of sound in the sound storehouse in original flow with the degree of adjacent cells coarticulation, weed out the strong excessively sample of coarticulation.
C. analyze the dystimbria degree of the female sample of sound, weed out the relatively poor sample of tonequality.
Above-mentioned embedded language synthetic method is characterized in that: described sample cluster step comprises following three steps:
A. the female unit of the sound step of presorting is used in conjunction with the context environmental attribute of sample sample being presorted; Adopt classification and regression tree (CART) method to classify, generate a CART tree for each sound is female.
B. simple or compound vowel of a Chinese syllable cluster step is used for the sample on each leafy node of the CART tree of simple or compound vowel of a Chinese syllable is carried out cluster; The fundamental curve that is characterized as simple or compound vowel of a Chinese syllable that cluster is selected for use only keeps the barycenter of every class, abandons all the other samples.
C. initial consonant cluster step is used for the sample on each leafy node of the CART tree of initial consonant is carried out cluster; The 12 rank Mel frequency marking cepstrum parameters (MFCC) that are characterized as initial consonant that cluster is selected for use.
Said method adopts the sound mother as the primitive compressibility of elevator system significantly, can reduce the acoustics redundance in the sound storehouse as far as possible, thereby realize high efficiency compression under the prerequisite of naturalness that keeps synthetic result and intelligibility.This method is compared with the synthetic method based on syllable under equal sound bank scale, and performance does not almost have difference.
For better realizing above-mentioned purpose, the present invention also provides a kind of embedded speech synthesis system, be applied to hand-held digital mobile device operating system, it is by speech synthesis system off-line part, and text load module, the online part of speech synthesis system and audio digital signals output module are formed; Wherein, the output terminal of speech synthesis system off-line part and text load module is electrically connected with the online part of speech synthesis system, and the output terminal of the online part of speech synthesis system is electrically connected with the input end of audio digital signals output module.
Described embedded speech synthesis system, its described speech synthesis system off-line part, only when working offline state, uses this speech synthesis system, only be used to generate the compressing voice library that to use when this synthesis system works online, the speech synthesis system off-line partly comprises the raw tone storehouse, and the raw tone storehouse comprises the raw tone through the energy consolidation that records.
Described embedded speech synthesis system, the online part of its described speech synthesis system comprises following module:
A. text analysis model is used for the text of described input is carried out the analysis on the format and content and is converted into sound auxiliary sequence string; Adhere to a series of relevant prosodic informations for each sound mother simultaneously;
B. rhythm prediction module, be used to receive the described sound auxiliary sequence string that adheres to prosodic information, utilize statistical model to dope the target rhythm value corresponding according to prosodic information, comprise sound mother's the duration of a sound, fundamental curve and average energy with it, and with it attached on the sound mother;
C. waveform concatenation module, be used to receive the described sound auxiliary sequence string that adheres to target rhythm value, from described compressing voice library, choose the sample sequence number the most approaching according to the prosodic information that described sound auxiliary sequence carries with target rhythm value, and utilization and the corresponding decompression algorithm of described encryption algorithm restore and the pairing voice signal of described sample sequence number, and it is stitched together, make smoothing processing in splicing place;
D. tone decoding module; And
E. compressing voice library;
Wherein, the text load module is electrically connected with text analysis model, rhythm prediction module, waveform concatenation sequence of modules; Speech synthesis system off-line part is electrically connected with compressing voice library, tone decoding module, waveform concatenation sequence of modules; The output terminal of waveform concatenation module is electrically connected with the audio digital signals output module, and the audio digital signals output module is used to play the described audio digital signals that is spliced into.
According to the embedded speech synthesis system that said method is set up, can under hand-held digital mobile device operating system, use fully, and shared resource all is no more than the ability that said handheld device itself is possessed with the computation complexity that needs.
The present invention is further described below in conjunction with drawings and Examples, will describe step of the present invention and the process of realizing better to the detailed description of each building block of system in conjunction with the drawings.
Embodiment
In accompanying drawing 1, in a preferred embodiment of the invention, embedded speech synthesis system of the present invention is arranged in a kind of operating system of palm PC, this embedded speech synthesis system comprises: speech synthesis system off-line part 1, the online part 3 of palm PC text load module 2, speech synthesis system and the audio digital signals output module 4 that are connected in turn.
Wherein, 1 of speech synthesis system off-line part is used when this speech synthesis system works offline state, only is used to generate the compressing voice library b that need use when this synthesis system works online.Wherein raw tone storehouse a comprises the raw tone through the energy consolidation that records, and the process that is generated compressing voice library b by raw tone storehouse a off-line comprises: sound mother pronunciation storehouse foundation step 70, sound mother pronunciation storehouse quantize compression step 80 and sound bank coding/packaging step 90.
In step 70, at first utilize speech recognition tools bag HTK that automatic segmentation is carried out in the raw tone storehouse that records, to obtain the boundary position information of sound mother pronunciation segment in original statement, adopt the fundamental detection toolmark to go out the peak point positional information of speech waveform simultaneously, and by hand the boundary position and the peak point position of described automatic acquisition are proofreaded; Then in the sound bank behind described cutting mark, the simple or compound vowel of a Chinese syllable that each initial consonant or simple or compound vowel of a Chinese syllable are adjacent according to its syllable inside or the pronunciation characteristic of initial consonant are further classified: initial consonant is divided into four classes, after connect opening and exhale, after connect class of syllables with i as the final or a final beginning with i, the back engages mouthful exhales, after connect pinch and mouthful to exhale; Simple or compound vowel of a Chinese syllable is divided into nine classes, the preceding unaspirated stop that connects, the preceding vent plug sound of picking, preceding unaspirated affricate, preceding vent plug fricative, preceding mute fricative, the fricative of preceding sending and receiving sound, the preceding nasal sound that connects, preceding edge fit sound, the zero initial simple or compound vowel of a Chinese syllable of connecing picked of connecing.Total initial consonant is 21 in the Chinese, and 43 of simple or compound vowel of a Chinese syllable then produce 471 of the female unit of sound of environmental correclation, altogether with the elementary cell of the female unit of sorted sound as sound bank.Simultaneously in conjunction with syntax analysis to the original statement text, draw the high-rise prosodic information of the female sample of each sound, comprise: with rhythm/initial consonant type and the ID of current sound/simple or compound vowel of a Chinese syllable with syllable, preceding syllable rhythm parent type and ID, back syllable initial consonant type and ID, the accent shape of the female place of sound syllable, shape transferred in preceding syllable, shape transferred in back syllable, (rhythm level comprises rhythm speech to the relative position of low level rhythmite time high-level relatively rhythm level, prosodic phrase, statement, relative position is included in the head of level, in, tail), the rhythm speech of syllable under the sound mother, prosodic phrase length (is unit with the syllable number), the length that the front and back of the affiliated syllable of sound mother are quiet section.Described all information are saved within the file, as this sound mother's message file.All sound mothers' original waveform file and message file are formed sound mother pronunciation storehouse jointly.
In step 80, as shown in Figure 2, be divided into following six steps as the quantification compression process of the sound bank of primitive with the sound mother:
Step 100, the compressing voice library of a sky of program creation.
Step 110, each whole original samples that from the raw tone storehouse, read in a sound mother.
Step 120, the female sample of sound is roughly selected step, as shown in Figure 3, is used for rejecting the distortion sample that this sound mother left behind at sound bank.There are a large amount of samples comparatively unusual from acoustic feature in the recorded influence of artificial phoneme such as people, sound pick-up outfit and sound bank mark in the sound storehouse.When sound bank was larger, the probability that these sounds are selected was less, influenced less to synthetic result.Can work as to the sound bank scale hour, the sample of the distortion that left behind then is easy to be selected and is used for synthetic speech, thereby significantly reduces synthetic result's stability, also will take valuable storage space simultaneously.Present embodiment adopts following three kinds of filter criteria successively, automatically sound bank is carried out prescreen, weeds out labile factor wherein.Wherein step 200 is used to read in whole samples of the female unit of certain sound.Step 210 is read in some samples of this unit.Step 220 is used to judge whether the sample that described step 210 is read in satisfies rhythm abnormality degree criterion.The rhythm factor of Kao Lving comprises the duration of a sound, fundamental curve and the energy of sample herein.The rhythm abnormality degree (ProsodicSalience) that defines i sample is:
Wherein each sub-abnormality degree is:
D (f), p (i) and e (i) are respectively the duration of a sound, fundamental frequency average and the average energy of i sample,
With
Be respectively the average of all sample individual features of this primitive.The weights ω of each sub-abnormality degree
1, ω
2And ω
3Draw according to experiment.To arbitrary sample i, right
X, { e} is if having for d, p for x ∈
D
x(i)>T
x (5)
Or
PS(i)>T (6)
Then delete this sample.Wherein Tx and T are respectively the threshold value of each sub-rhythm abnormality degree and total rhythm abnormality degree.This criterion can reject the duration of a sound or peak point marks the sample of makeing mistakes, and the energy that human factor causes in the Recording Process is crossed weak or strong excessively sample.Step 230 is used to judge whether the sample that described step 210 is read in satisfies the degree of adhesion criterion.This criterion investigate sample in sound storehouse in original flow with the degree of adjacent cells coarticulation.Concerning based on the system in little sound storehouse, splicing place is particularly serious by the discontinuous tonequality loss that causes of spectrum, and to reject the stronger sound of degree of adhesion be a kind of feasible scheme as far as possible building the storehouse stage.The degree of adhesion (Context Dependency) that defines i sample is:
(7)
Wherein
With
Be respectively the average energy of the left and right boundary of sample, can determine its weights according to the acoustic feature of unit.In the present embodiment, plosive and affricate are made ω
1Be 0.Similarly, if the CD (i) of sample i then rejects this sample greater than certain threshold value T.Step 240 is used to judge whether the sample that described step 210 is read in satisfies dystimbria degree criterion.The recording people because some sample tonequality that tired or other psychological factor may cause recording occurs unusually, shows as gas sound, whisper in sb.'s ear or the tangible emotion of mixing in the process of long-term recording.These sounds often appear at sentence ending place, and energy is on the weak side, and the periodicity of vowel is relatively poor.To sample i, define its dystimbria degree (QualityDistortion) and be:
(8)
N wherein
Peak(i) be the number of this sample peak point,
Be average energy, dur (i) is the duration of a sound of this sample.If the CD (i) of sample i then rejects this sample greater than certain threshold value T.In step 250,, then it is retained in the compressing voice library if sample satisfies described three criterions.Step 260 judges whether to handle all samples in the female unit of this sound, if not, then returns step 210, up to handling all samples; If then implementation step 270.Step 270 judges whether to handle the female unit of all sound, if not, then returns step 200; If then sample is roughly selected
step 120 end.
Step 130, the female sample cluster of sound step, as shown in Figure 4, female sample evidence segment5al feature of sound and the further cluster of Supersonic section feature after being used for described step 120 roughly selected, remain with the representative of the barycenter of each class after the cluster, abandon the female samples of all the other sound as such.At first the sound mother is presorted, respectively the sound mother after presorting is compressed based on the further cluster of acoustic feature separately then, thereby keep the diversity of compressing voice library mid feature and Supersonic section feature.Wherein step 300 is used to read in whole samples of the female unit of certain sound.Step 310 is presorted to sample based on the phonology environment attribute, and the CART method in the present embodiment in the employing data mining field is as classification tool, and the decision attribute of choosing is based upon on the contextual description, comprising:
With rhythm/initial consonant type and the ID of current sound/simple or compound vowel of a Chinese syllable with syllable.
Preceding syllable rhythm parent type and ID
Back syllable initial consonant type and ID
The accent shape of the female place of sound syllable, shape transferred in preceding syllable, and shape (comprising high and level tone, rising tone, last sound, falling tone, five kinds softly) transferred in back syllable.
The relative position of low level rhythmite time high-level relatively rhythm level, rhythm level comprises rhythm speech, prosodic phrase, statement.Relative position be included in level head, in, tail.
The rhythm speech length of syllable under the sound mother, prosodic phrase length is unit with the syllable number.
The length that the front and back of the affiliated syllable of sound mother are quiet section.
And select the characteristic parameter of 12 rank Mel frequency marking cepstrum parameters (MFCC) for use as the female unit of sound, select mahalanobis for use apart from the distance of coming between computing unit.Unit M, the distance definition of N such as Eq. (9)
P wherein
Ij(M) be j MFCC parameter of i frame, | M| is the frame number of M.During actual computation, the MFCC of transition section between the inner sound mother of syllable is also contained within sound mother's the parameter vector, purpose is better modeling to be carried out in the coarticulation between the sound mother, makes classification results responsive more to the sound mother who is adjacent.Utilize CART training tool wagon to generate a CART tree for each sound is female, number of samples is controlled between the 50-100 on the leafy node.Step 320 judges that active cell is initial consonant or simple or compound vowel of a Chinese syllable, if initial consonant then adopts the MFCC parameter of initial consonant sample to come the sample on this initial consonant CART leaf node is carried out cluster by step 330; If simple or compound vowel of a Chinese syllable then adopts the fundamental curve of simple or compound vowel of a Chinese syllable sample to come the sample on this simple or compound vowel of a Chinese syllable CART leaf node is carried out cluster by step 340.Step 350 judges whether to handle the female unit of all sound, if then sample cluster step 130 finishes; If not, then return step 300, handle the female unit of other sound.
Step 140 keeps barycenter sample on the female unit of described all sound CART leaf nodes to final compressing voice library, abandons all other samples.
Step 150 judges whether to handle the female unit of whole sound, if not, then returns step 110, and repeating step 110,120,130,140 and 150 is up to handling the female unit of whole sound; If then sound mother pronunciation storehouse quantizes compression step 80 end.
In step 90, the female sample of sound in the sound bank that described quantification was compressed is compressed into the littler voice snippet that takes up room by certain voice compression algorithm, and the message file of generation in the wave file and 70 after will encode in some way is organized into the form of a file.In an embodiment of the present invention, the compressing voice library packing method that adopts in the step 90 is the form that is combined into a file with the voice code word of certain rule after with encoding compression, and the index of this compressing voice library is to represent different sound mothers' symbol to set up according to being used for.In an embodiment of the present invention, the voice compression algorithm of the compressing voice library that is adopted can be that any one can state handheld device resource requirement (comprising storage space and computation complexity) and can reach the algorithm that the sense of hearing requires (user is satisfied), for example: G.723.1 wait to have voice compression algorithm low code check, that in communication system, extensively adopt, perhaps other have the voice coding/decoding algorithms of high compression rate and low distortion, as long as its computational complexity and memory requirement can move on described handheld device.Can generate compressing voice library b by step 90, system off-line part of module 1 power cut-off so far.
As shown in Figure 1, text load module 2 receives the text of input, and in an embodiment of the present invention, system provides can be for the interface of handwriting input, and the writing pencil that the user can select to adopt palm PC to carry is imported text to be synthesized voluntarily; Also can select to synthesize whole file by the mode of opening text, the several rows that the user also can use writing pencil to select in the file are synthetic separately.
The online part 3 of speech synthesis system comprises text analysis model 20, rhythm prediction module 30, waveform concatenation module 40, tone decoding module 60 and the compressed voice library module b that is connected in turn again.Wherein, text analysis model 20 can receive the input of textual form, will import Chinese character by the format and content of analyzing input text and convert corresponding sound auxiliary sequence string to; Adhere to a series of relevant prosodic informations for each sound mother simultaneously.Rhythm prediction module 30 is used to receive the described sound auxiliary sequence string that adheres to prosodic information, utilize statistical model to dope the target rhythm value corresponding according to prosodic information with it, the duration of a sound, fundamental curve and the average energy that comprise the sound mother, and with it attached on the sound mother.Waveform concatenation module 40, be used to receive the described sound auxiliary sequence string that adheres to target rhythm value, from described compressing voice library, choose the sample sequence number the most approaching according to the prosodic information that described sequence is carried with target rhythm value, and utilization and the corresponding decompression algorithm of described encryption algorithm restore and the pairing voice signal of described sample sequence number, and it is stitched together, make smoothing processing in splicing place.
Audio digital signals output module 4 is used to play the described audio digital signals that is spliced into.
The present invention relates to a kind of phoneme synthesizing method and system, can promote the compressibility in synthetic speech system sound storehouse under the embedded platform based on this method, thereby reduce its system resource shared under embedded platform greatly, can make synthetic result keep naturalness and intelligibility preferably simultaneously.
The present invention is using on the palm PC, and all phonetic functions all can be enabled on handheld device or be closed at any time.When the not enabled phonetic function, the various functions of former handheld device will not be affected.
The foregoing description is preferred embodiment of the present invention, and application of the present invention is not limited only to palm PC, also may be used on multiple hand-held mobile device.According to main design of the present invention, those of ordinary skills all can produce multiple similar or of equal value application, and therefore, protection of the present invention should be as the criterion with the protection domain of claim.