CN1924994B - Embedded language synthetic method and system - Google Patents

Embedded language synthetic method and system Download PDF

Info

Publication number
CN1924994B
CN1924994B CN2005100863115A CN200510086311A CN1924994B CN 1924994 B CN1924994 B CN 1924994B CN 2005100863115 A CN2005100863115 A CN 2005100863115A CN 200510086311 A CN200510086311 A CN 200510086311A CN 1924994 B CN1924994 B CN 1924994B
Authority
CN
China
Prior art keywords
sound
sample
synthesis system
female
speech synthesis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2005100863115A
Other languages
Chinese (zh)
Other versions
CN1924994A (en
Inventor
陶建华
张皖志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2005100863115A priority Critical patent/CN1924994B/en
Publication of CN1924994A publication Critical patent/CN1924994A/en
Application granted granted Critical
Publication of CN1924994B publication Critical patent/CN1924994B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention discloses one imbed sound integration method and system, which is used in hand digital movable device operation system and converts the string received into sound output, wherein, using sound syllable in character as integration system and basic unit; first establishing original sound base; then according to the syllable environmental property compressing the original sound base; finally through sound compression formula language materials for code compression to get final compression sound base.

Description

A kind of embedded language synthetic method and system
Technical field
Present invention relates in general to a kind of phoneme synthesizing method and speech synthesis system, relate in particular to a kind ofly, comprise the phoneme synthesizing method and the system of mobile phone and palm PC with towards portable hand-held digital mobile equipment.
Background technology
Speech synthesis system claims text-to-speech system (tts system) again, and its major function is that any text strings of input that computing machine is received converts voice output to.It is generally acknowledged that the functional module of speech synthesis system can be subdivided into three main ingredients: text analysis model, rhythm generation module and acoustic module.Gradually become the mainstream technology in phonetic synthesis field in recent years based on the synthetic method of extensive corpus, the basic thought of its realization is to select voice unit to splice according to specific rule from a large amount of natural flows, and voice unit carried out the adjustment and the modification of rhythm characteristic, thereby obtain satisfactory synthetic speech.For guaranteeing that synthetic result possesses abundant rhythm performance, its sound storehouse scale often reaches hundreds of megabyte.Adopting such method is not problem to the configuration when CPU processing power, internal memory and other resources of front desktop PC, but for the portable hand-held digital mobile equipment (comprising mobile phone and palm PC) of current main-stream, its storage space and arithmetic capability then can't satisfy practical requirement far away.
Present existing embedded speech synthesis system all adopts syllable in the Chinese as the elementary cell of synthesis system, on the basis of existing omnibus language sound bank, carry out cluster apart from all samples to each syllable according to the acoustics between sample, only remain as such other representative with the cluster barycenter, discard inner other samples of this classification simultaneously, thereby realize compression sound bank.The sound bank that obtains based on said method needs the storage space more than the 1M at least, if continue it is compressed, then the sample number that remains of each syllable will significantly reduce, thereby cause the naturalness of synthetic speech and tonequality significantly to descend.The scale of above-mentioned embedded synthesis system remains bigger expense for the handheld device of current main-stream, the resource of its requirement is cost prohibitive for its importance on handheld device, and the user is difficult to accept.Therefore, need a kind of improved method, be used under embedded platform, realizing taking the less speech synthesis system of resource.
Summary of the invention
Technical matters to be solved by this invention provides a kind of Chinese speech synthetic method and system that can use on portable hand-held digital mobile equipment, it takies few system resource, can make synthetic result keep naturalness and intelligibility preferably simultaneously.
For achieving the above object, the invention provides a kind of embedded language synthetic method, the any text strings that is used for the input that hand-held digital mobile device operating system receives system converts voice output to, and it is with the elementary cell of the sound mother in the Chinese as synthesis system; The quantification compression process of sound bank is divided into following three steps:
A. create raw tone storehouse based on the sound mother.
B. context environmental attribute and the acoustic feature based on the female sample of sound quantizes compression to described raw tone storehouse.
C. the corpus after by voice compression algorithm described quantification being compressed carries out encoding compression, obtains final compressing voice library.
Above-mentioned embedded language synthetic method is characterized in that: described as follows based on raw tone storehouse, the female unit of sound constructive process: the simple or compound vowel of a Chinese syllable that each initial consonant or simple or compound vowel of a Chinese syllable in the sound bank are adjacent according to syllable inside or the pronunciation characteristic of initial consonant are further classified.
Above-mentioned embedded language synthetic method is characterized in that: be divided into following six steps with the sound mother as the quantification compression process of the sound bank of primitive:
A. create the sound bank of a sky.
B. from the raw tone storehouse, read in whole original samples of a sound mother at every turn.
C. the female sample of sound is roughly selected step, is used for rejecting the influence of artificial phoneme such as all people that recorded of this sound mother, sound pick-up outfit and sound bank mark and the distortion sample that left behind in sound bank.
D. the female sample cluster of sound step is used for female sample evidence segment5al feature of the sound after described the roughly selecting and the further cluster of Supersonic section feature, remains with the representative as such of the barycenter of each class after the cluster, abandons the female samples of all the other sound.
E. the female sample of whole barycenter sound is deposited in the newly-established compressing voice library.
F. judge whether to handle the female unit of whole sound, if then the off-line subprogram finishes; If not, then return step B repeating step B, C, D, E, up to handling whole original corpus.
Above-mentioned embedded language synthetic method is characterized in that: the female sample of described sound is roughly selected step and is comprised following three steps:
A. the average rhythm characteristic of the female sample of the inner original sound of statistical analysis unit weeds out and departs from average characteristics sample far away excessively; The rhythm characteristic of considering comprises fundamental curve, the duration of a sound and the average energy of sample.
B. investigate the female sample of sound in the sound storehouse in original flow with the degree of adjacent cells coarticulation, weed out the strong excessively sample of coarticulation.
C. analyze the dystimbria degree of the female sample of sound, weed out the relatively poor sample of tonequality.
Above-mentioned embedded language synthetic method is characterized in that: described sample cluster step comprises following three steps:
A. the female unit of the sound step of presorting is used in conjunction with the context environmental attribute of sample sample being presorted; Adopt classification and regression tree (CART) method to classify, generate a CART tree for each sound is female.
B. simple or compound vowel of a Chinese syllable cluster step is used for the sample on each leafy node of the CART tree of simple or compound vowel of a Chinese syllable is carried out cluster; The fundamental curve that is characterized as simple or compound vowel of a Chinese syllable that cluster is selected for use only keeps the barycenter of every class, abandons all the other samples.
C. initial consonant cluster step is used for the sample on each leafy node of the CART tree of initial consonant is carried out cluster; The 12 rank Mel frequency marking cepstrum parameters (MFCC) that are characterized as initial consonant that cluster is selected for use.
Said method adopts the sound mother as the primitive compressibility of elevator system significantly, can reduce the acoustics redundance in the sound storehouse as far as possible, thereby realize high efficiency compression under the prerequisite of naturalness that keeps synthetic result and intelligibility.This method is compared with the synthetic method based on syllable under equal sound bank scale, and performance does not almost have difference.
For better realizing above-mentioned purpose, the present invention also provides a kind of embedded speech synthesis system, be applied to hand-held digital mobile device operating system, it is by speech synthesis system off-line part, and text load module, the online part of speech synthesis system and audio digital signals output module are formed; Wherein, the output terminal of speech synthesis system off-line part and text load module is electrically connected with the online part of speech synthesis system, and the output terminal of the online part of speech synthesis system is electrically connected with the input end of audio digital signals output module.
Described embedded speech synthesis system, its described speech synthesis system off-line part, only when working offline state, uses this speech synthesis system, only be used to generate the compressing voice library that to use when this synthesis system works online, the speech synthesis system off-line partly comprises the raw tone storehouse, and the raw tone storehouse comprises the raw tone through the energy consolidation that records.
Described embedded speech synthesis system, the online part of its described speech synthesis system comprises following module:
A. text analysis model is used for the text of described input is carried out the analysis on the format and content and is converted into sound auxiliary sequence string; Adhere to a series of relevant prosodic informations for each sound mother simultaneously;
B. rhythm prediction module, be used to receive the described sound auxiliary sequence string that adheres to prosodic information, utilize statistical model to dope the target rhythm value corresponding according to prosodic information, comprise sound mother's the duration of a sound, fundamental curve and average energy with it, and with it attached on the sound mother;
C. waveform concatenation module, be used to receive the described sound auxiliary sequence string that adheres to target rhythm value, from described compressing voice library, choose the sample sequence number the most approaching according to the prosodic information that described sound auxiliary sequence carries with target rhythm value, and utilization and the corresponding decompression algorithm of described encryption algorithm restore and the pairing voice signal of described sample sequence number, and it is stitched together, make smoothing processing in splicing place;
D. tone decoding module; And
E. compressing voice library;
Wherein, the text load module is electrically connected with text analysis model, rhythm prediction module, waveform concatenation sequence of modules; Speech synthesis system off-line part is electrically connected with compressing voice library, tone decoding module, waveform concatenation sequence of modules; The output terminal of waveform concatenation module is electrically connected with the audio digital signals output module, and the audio digital signals output module is used to play the described audio digital signals that is spliced into.
According to the embedded speech synthesis system that said method is set up, can under hand-held digital mobile device operating system, use fully, and shared resource all is no more than the ability that said handheld device itself is possessed with the computation complexity that needs.
The present invention is further described below in conjunction with drawings and Examples, will describe step of the present invention and the process of realizing better to the detailed description of each building block of system in conjunction with the drawings.
Description of drawings
Accompanying drawing 1 is based on the structural representation of sound mother's embedded speech synthesis system;
Accompanying drawing 2 sound mother pronunciation storehouses quantize the compression process synoptic diagram;
The female sample rougher process of accompanying drawing 3 sound synoptic diagram;
The female sample cluster process of accompanying drawing 4 sound synoptic diagram.
Embodiment
In accompanying drawing 1, in a preferred embodiment of the invention, embedded speech synthesis system of the present invention is arranged in a kind of operating system of palm PC, this embedded speech synthesis system comprises: speech synthesis system off-line part 1, the online part 3 of palm PC text load module 2, speech synthesis system and the audio digital signals output module 4 that are connected in turn.
Wherein, 1 of speech synthesis system off-line part is used when this speech synthesis system works offline state, only is used to generate the compressing voice library b that need use when this synthesis system works online.Wherein raw tone storehouse a comprises the raw tone through the energy consolidation that records, and the process that is generated compressing voice library b by raw tone storehouse a off-line comprises: sound mother pronunciation storehouse foundation step 70, sound mother pronunciation storehouse quantize compression step 80 and sound bank coding/packaging step 90.
In step 70, at first utilize speech recognition tools bag HTK that automatic segmentation is carried out in the raw tone storehouse that records, to obtain the boundary position information of sound mother pronunciation segment in original statement, adopt the fundamental detection toolmark to go out the peak point positional information of speech waveform simultaneously, and by hand the boundary position and the peak point position of described automatic acquisition are proofreaded; Then in the sound bank behind described cutting mark, the simple or compound vowel of a Chinese syllable that each initial consonant or simple or compound vowel of a Chinese syllable are adjacent according to its syllable inside or the pronunciation characteristic of initial consonant are further classified: initial consonant is divided into four classes, after connect opening and exhale, after connect class of syllables with i as the final or a final beginning with i, the back engages mouthful exhales, after connect pinch and mouthful to exhale; Simple or compound vowel of a Chinese syllable is divided into nine classes, the preceding unaspirated stop that connects, the preceding vent plug sound of picking, preceding unaspirated affricate, preceding vent plug fricative, preceding mute fricative, the fricative of preceding sending and receiving sound, the preceding nasal sound that connects, preceding edge fit sound, the zero initial simple or compound vowel of a Chinese syllable of connecing picked of connecing.Total initial consonant is 21 in the Chinese, and 43 of simple or compound vowel of a Chinese syllable then produce 471 of the female unit of sound of environmental correclation, altogether with the elementary cell of the female unit of sorted sound as sound bank.Simultaneously in conjunction with syntax analysis to the original statement text, draw the high-rise prosodic information of the female sample of each sound, comprise: with rhythm/initial consonant type and the ID of current sound/simple or compound vowel of a Chinese syllable with syllable, preceding syllable rhythm parent type and ID, back syllable initial consonant type and ID, the accent shape of the female place of sound syllable, shape transferred in preceding syllable, shape transferred in back syllable, (rhythm level comprises rhythm speech to the relative position of low level rhythmite time high-level relatively rhythm level, prosodic phrase, statement, relative position is included in the head of level, in, tail), the rhythm speech of syllable under the sound mother, prosodic phrase length (is unit with the syllable number), the length that the front and back of the affiliated syllable of sound mother are quiet section.Described all information are saved within the file, as this sound mother's message file.All sound mothers' original waveform file and message file are formed sound mother pronunciation storehouse jointly.
In step 80, as shown in Figure 2, be divided into following six steps as the quantification compression process of the sound bank of primitive with the sound mother:
Step 100, the compressing voice library of a sky of program creation.
Step 110, each whole original samples that from the raw tone storehouse, read in a sound mother.
Step 120, the female sample of sound is roughly selected step, as shown in Figure 3, is used for rejecting the distortion sample that this sound mother left behind at sound bank.There are a large amount of samples comparatively unusual from acoustic feature in the recorded influence of artificial phoneme such as people, sound pick-up outfit and sound bank mark in the sound storehouse.When sound bank was larger, the probability that these sounds are selected was less, influenced less to synthetic result.Can work as to the sound bank scale hour, the sample of the distortion that left behind then is easy to be selected and is used for synthetic speech, thereby significantly reduces synthetic result's stability, also will take valuable storage space simultaneously.Present embodiment adopts following three kinds of filter criteria successively, automatically sound bank is carried out prescreen, weeds out labile factor wherein.Wherein step 200 is used to read in whole samples of the female unit of certain sound.Step 210 is read in some samples of this unit.Step 220 is used to judge whether the sample that described step 210 is read in satisfies rhythm abnormality degree criterion.The rhythm factor of Kao Lving comprises the duration of a sound, fundamental curve and the energy of sample herein.The rhythm abnormality degree (ProsodicSalience) that defines i sample is:
PS ( i ) = ω 1 D d ( i ) + ω 2 D p ( i ) + ω 3 D e ( i ) ω 1 + ω 2 + ω 3 - - - ( 1 )
Wherein each sub-abnormality degree is:
D d ( i ) = ( d ( i ) - d ‾ d ‾ ) 2 - - - ( 2 )
D p ( i ) = ( p ( i ) - p ‾ p ‾ ) 2 - - - ( 3 )
D e ( i ) = ( e ( i ) - e ‾ e ‾ ) 2 - - - ( 4 )
D (f), p (i) and e (i) are respectively the duration of a sound, fundamental frequency average and the average energy of i sample,
Figure S05186311520050908D000071
With
Figure S05186311520050908D000072
Be respectively the average of all sample individual features of this primitive.The weights ω of each sub-abnormality degree 1, ω 2And ω 3Draw according to experiment.To arbitrary sample i, right
Figure 051863115_0
X, { e} is if having for d, p for x ∈
D x(i)>T x (5)
Or
PS(i)>T (6)
Then delete this sample.Wherein Tx and T are respectively the threshold value of each sub-rhythm abnormality degree and total rhythm abnormality degree.This criterion can reject the duration of a sound or peak point marks the sample of makeing mistakes, and the energy that human factor causes in the Recording Process is crossed weak or strong excessively sample.Step 230 is used to judge whether the sample that described step 210 is read in satisfies the degree of adhesion criterion.This criterion investigate sample in sound storehouse in original flow with the degree of adjacent cells coarticulation.Concerning based on the system in little sound storehouse, splicing place is particularly serious by the discontinuous tonequality loss that causes of spectrum, and to reject the stronger sound of degree of adhesion be a kind of feasible scheme as far as possible building the storehouse stage.The degree of adhesion (Context Dependency) that defines i sample is:
CD ( i ) = ω l e ‾ l ( i ) + ω r e ‾ r ( i ) ω l + ω r (7)
Wherein
Figure S05186311520050908D000075
With Be respectively the average energy of the left and right boundary of sample, can determine its weights according to the acoustic feature of unit.In the present embodiment, plosive and affricate are made ω 1Be 0.Similarly, if the CD (i) of sample i then rejects this sample greater than certain threshold value T.Step 240 is used to judge whether the sample that described step 210 is read in satisfies dystimbria degree criterion.The recording people because some sample tonequality that tired or other psychological factor may cause recording occurs unusually, shows as gas sound, whisper in sb.'s ear or the tangible emotion of mixing in the process of long-term recording.These sounds often appear at sentence ending place, and energy is on the weak side, and the periodicity of vowel is relatively poor.To sample i, define its dystimbria degree (QualityDistortion) and be:
QD ( i ) = n peak ( i ) e ‾ ( i ) · dur ( i ) (8)
N wherein Peak(i) be the number of this sample peak point,
Figure S05186311520050908D000078
Be average energy, dur (i) is the duration of a sound of this sample.If the CD (i) of sample i then rejects this sample greater than certain threshold value T.In step 250,, then it is retained in the compressing voice library if sample satisfies described three criterions.Step 260 judges whether to handle all samples in the female unit of this sound, if not, then returns step 210, up to handling all samples; If then implementation step 270.Step 270 judges whether to handle the female unit of all sound, if not, then returns step 200; If then sample is roughly selected step 120 end.
Step 130, the female sample cluster of sound step, as shown in Figure 4, female sample evidence segment5al feature of sound and the further cluster of Supersonic section feature after being used for described step 120 roughly selected, remain with the representative of the barycenter of each class after the cluster, abandon the female samples of all the other sound as such.At first the sound mother is presorted, respectively the sound mother after presorting is compressed based on the further cluster of acoustic feature separately then, thereby keep the diversity of compressing voice library mid feature and Supersonic section feature.Wherein step 300 is used to read in whole samples of the female unit of certain sound.Step 310 is presorted to sample based on the phonology environment attribute, and the CART method in the present embodiment in the employing data mining field is as classification tool, and the decision attribute of choosing is based upon on the contextual description, comprising:
Figure S05186311520050908D000081
With rhythm/initial consonant type and the ID of current sound/simple or compound vowel of a Chinese syllable with syllable.
Figure S05186311520050908D000082
Preceding syllable rhythm parent type and ID
Figure S05186311520050908D000083
Back syllable initial consonant type and ID
The accent shape of the female place of sound syllable, shape transferred in preceding syllable, and shape (comprising high and level tone, rising tone, last sound, falling tone, five kinds softly) transferred in back syllable.
Figure S05186311520050908D000085
The relative position of low level rhythmite time high-level relatively rhythm level, rhythm level comprises rhythm speech, prosodic phrase, statement.Relative position be included in level head, in, tail.
Figure S05186311520050908D000086
The rhythm speech length of syllable under the sound mother, prosodic phrase length is unit with the syllable number.
Figure S05186311520050908D000087
The length that the front and back of the affiliated syllable of sound mother are quiet section.
And select the characteristic parameter of 12 rank Mel frequency marking cepstrum parameters (MFCC) for use as the female unit of sound, select mahalanobis for use apart from the distance of coming between computing unit.Unit M, the distance definition of N such as Eq. (9)
dis ( M , N ) = Σ i = 1 | M | Σ j = 1 12 [ P ij ( M ) - P ( i · | M | | N | ) j ( N ) ] 2 - - - ( 9 )
P wherein Ij(M) be j MFCC parameter of i frame, | M| is the frame number of M.During actual computation, the MFCC of transition section between the inner sound mother of syllable is also contained within sound mother's the parameter vector, purpose is better modeling to be carried out in the coarticulation between the sound mother, makes classification results responsive more to the sound mother who is adjacent.Utilize CART training tool wagon to generate a CART tree for each sound is female, number of samples is controlled between the 50-100 on the leafy node.Step 320 judges that active cell is initial consonant or simple or compound vowel of a Chinese syllable, if initial consonant then adopts the MFCC parameter of initial consonant sample to come the sample on this initial consonant CART leaf node is carried out cluster by step 330; If simple or compound vowel of a Chinese syllable then adopts the fundamental curve of simple or compound vowel of a Chinese syllable sample to come the sample on this simple or compound vowel of a Chinese syllable CART leaf node is carried out cluster by step 340.Step 350 judges whether to handle the female unit of all sound, if then sample cluster step 130 finishes; If not, then return step 300, handle the female unit of other sound.
Step 140 keeps barycenter sample on the female unit of described all sound CART leaf nodes to final compressing voice library, abandons all other samples.
Step 150 judges whether to handle the female unit of whole sound, if not, then returns step 110, and repeating step 110,120,130,140 and 150 is up to handling the female unit of whole sound; If then sound mother pronunciation storehouse quantizes compression step 80 end.
In step 90, the female sample of sound in the sound bank that described quantification was compressed is compressed into the littler voice snippet that takes up room by certain voice compression algorithm, and the message file of generation in the wave file and 70 after will encode in some way is organized into the form of a file.In an embodiment of the present invention, the compressing voice library packing method that adopts in the step 90 is the form that is combined into a file with the voice code word of certain rule after with encoding compression, and the index of this compressing voice library is to represent different sound mothers' symbol to set up according to being used for.In an embodiment of the present invention, the voice compression algorithm of the compressing voice library that is adopted can be that any one can state handheld device resource requirement (comprising storage space and computation complexity) and can reach the algorithm that the sense of hearing requires (user is satisfied), for example: G.723.1 wait to have voice compression algorithm low code check, that in communication system, extensively adopt, perhaps other have the voice coding/decoding algorithms of high compression rate and low distortion, as long as its computational complexity and memory requirement can move on described handheld device.Can generate compressing voice library b by step 90, system off-line part of module 1 power cut-off so far.
As shown in Figure 1, text load module 2 receives the text of input, and in an embodiment of the present invention, system provides can be for the interface of handwriting input, and the writing pencil that the user can select to adopt palm PC to carry is imported text to be synthesized voluntarily; Also can select to synthesize whole file by the mode of opening text, the several rows that the user also can use writing pencil to select in the file are synthetic separately.
The online part 3 of speech synthesis system comprises text analysis model 20, rhythm prediction module 30, waveform concatenation module 40, tone decoding module 60 and the compressed voice library module b that is connected in turn again.Wherein, text analysis model 20 can receive the input of textual form, will import Chinese character by the format and content of analyzing input text and convert corresponding sound auxiliary sequence string to; Adhere to a series of relevant prosodic informations for each sound mother simultaneously.Rhythm prediction module 30 is used to receive the described sound auxiliary sequence string that adheres to prosodic information, utilize statistical model to dope the target rhythm value corresponding according to prosodic information with it, the duration of a sound, fundamental curve and the average energy that comprise the sound mother, and with it attached on the sound mother.Waveform concatenation module 40, be used to receive the described sound auxiliary sequence string that adheres to target rhythm value, from described compressing voice library, choose the sample sequence number the most approaching according to the prosodic information that described sequence is carried with target rhythm value, and utilization and the corresponding decompression algorithm of described encryption algorithm restore and the pairing voice signal of described sample sequence number, and it is stitched together, make smoothing processing in splicing place.
Audio digital signals output module 4 is used to play the described audio digital signals that is spliced into.
The present invention relates to a kind of phoneme synthesizing method and system, can promote the compressibility in synthetic speech system sound storehouse under the embedded platform based on this method, thereby reduce its system resource shared under embedded platform greatly, can make synthetic result keep naturalness and intelligibility preferably simultaneously.
The present invention is using on the palm PC, and all phonetic functions all can be enabled on handheld device or be closed at any time.When the not enabled phonetic function, the various functions of former handheld device will not be affected.
The foregoing description is preferred embodiment of the present invention, and application of the present invention is not limited only to palm PC, also may be used on multiple hand-held mobile device.According to main design of the present invention, those of ordinary skills all can produce multiple similar or of equal value application, and therefore, protection of the present invention should be as the criterion with the protection domain of claim.

Claims (7)

1. an embedded language synthetic method is used for hand-held digital mobile device operating system, and any text strings that system is received or input converts voice output to, it is characterized in that: with the elementary cell of the sound mother in the Chinese as synthesis system and sound bank; The quantification compression process of sound bank is divided into following three steps:
A. create raw tone storehouse based on the sound mother;
B. based on the context environmental attribute and the acoustic feature of the female sample of sound, described raw tone storehouse is quantized compression;
C. the sound bank after by voice compression algorithm described quantification being compressed carries out encoding compression, obtains the sound bank after final quantification is compressed.
2. embedded language synthetic method according to claim 1, it is characterized in that: the described A step, as follows based on raw tone storehouse, the female unit of sound constructive process: the simple or compound vowel of a Chinese syllable that each initial consonant or simple or compound vowel of a Chinese syllable in the sound bank are adjacent according to syllable inside or the pronunciation characteristic of initial consonant are further classified.
3. embedded language synthetic method according to claim 1 is characterized in that: in the described B step, be divided into following six steps with the sound mother as the quantification compression process of the sound bank of primitive:
A. create the sound bank of a sky;
B. from the raw tone storehouse, read in whole original samples of a sound mother at every turn;
C. the female sample of sound is roughly selected step, and the artificial phoneme that is used for rejecting all people that recorded of this sound mother, sound pick-up outfit and sound bank mark influences, and the distortion sample that in sound bank, left behind;
D. the female sample cluster of sound step is used for female sample evidence segment5al feature of the sound after described the roughly selecting and the further cluster of Supersonic section feature, remains with the representative as such of the barycenter of each class after the cluster, abandons the female samples of all the other sound;
E. the female sample of whole barycenter sound is deposited in the newly-established compressing voice library;
F. judge whether to handle the female unit of whole sound, if then the quantification compression process of sound bank finishes; If not, then return step B repeating step B, C, D, E, up to handling whole raw tones storehouse.
4. embedded language synthetic method according to claim 3 is characterized in that: in the described C step, the female sample of sound is roughly selected step and is comprised following three steps:
A. the average rhythm characteristic of the female sample of the inner original sound of statistical analysis unit weeds out and departs from average characteristics sample far away excessively; The rhythm characteristic of considering comprises fundamental curve, the duration of a sound and the average energy of sample;
B. investigate the female sample of sound in the sound storehouse in original flow with the degree of adjacent cells coarticulation, weed out the strong excessively sample of coarticulation;
C. analyze the dystimbria degree of the female sample of sound, weed out the relatively poor sample of tonequality.
5. embedded language synthetic method according to claim 3 is characterized in that: in the described D step, sample cluster step comprises following three steps:
A. the female unit of the sound step of presorting is used in conjunction with the context environmental attribute of sample sample being presorted; Adopt classification and regression tree (CART) method to classify, generate a CART tree for each sound is female;
B. simple or compound vowel of a Chinese syllable cluster step is used for the sample on each leafy node of the CART tree of simple or compound vowel of a Chinese syllable is carried out cluster; The fundamental curve that is characterized as simple or compound vowel of a Chinese syllable that cluster is selected for use only keeps the barycenter of every class, abandons all the other samples;
C. initial consonant cluster step is used for the sample on each leafy node of the CART tree of initial consonant is carried out cluster; The 12 rank Mel frequency marking cepstrum parameters (MFCC) that are characterized as initial consonant that cluster is selected for use.
6. embedded speech synthesis system, it uses embedded language synthetic method as claimed in claim 1, can use under hand-held digital mobile device operating system, it is characterized in that:
By speech synthesis system off-line part, text load module, the online part of speech synthesis system and audio digital signals output module are formed; Wherein, the output terminal of speech synthesis system off-line part and text load module is electrically connected with the online part of speech synthesis system, and the output terminal of the online part of speech synthesis system is electrically connected with the input end of audio digital signals output module;
Described embedded speech synthesis system, wherein: described speech synthesis system off-line part, only when working offline state, uses this speech synthesis system, only be used to generate the compressing voice library that to use when this synthesis system works online, the speech synthesis system off-line partly comprises the raw tone storehouse, and the raw tone storehouse comprises the raw tone through the energy consolidation that records.
7. embedded speech synthesis system as claimed in claim 6 is characterized in that: the online part of described speech synthesis system comprises following module:
A. text analysis model is used for the text of described input is carried out the analysis on the format and content and is converted into sound auxiliary sequence string; Adhere to a series of relevant prosodic informations for each sound mother simultaneously;
B. rhythm prediction module, be used to receive the described sound auxiliary sequence string that adheres to prosodic information, utilize statistical model to dope the target rhythm value corresponding according to prosodic information, comprise sound mother's the duration of a sound, fundamental curve and average energy with it, and with it attached on the sound mother;
C. waveform concatenation module, be used to receive the described sound auxiliary sequence string that adheres to target rhythm value, from described compressing voice library, choose the sample sequence number the most approaching according to the prosodic information that described sound auxiliary sequence carries with target rhythm value, and utilization and the corresponding decompression algorithm of described encryption algorithm restore and the pairing voice signal of described sample sequence number, and it is stitched together, make smoothing processing in splicing place;
D. tone decoding module; And
E. compressing voice library;
Wherein, the text load module is electrically connected with text analysis model, rhythm prediction module, waveform concatenation sequence of modules; Speech synthesis system off-line part is electrically connected with compressing voice library, tone decoding module, waveform concatenation sequence of modules; The output terminal of waveform concatenation module is electrically connected with the audio digital signals output module, and the audio digital signals output module is used to play the described audio digital signals that is spliced into.
CN2005100863115A 2005-08-31 2005-08-31 Embedded language synthetic method and system Active CN1924994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2005100863115A CN1924994B (en) 2005-08-31 2005-08-31 Embedded language synthetic method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2005100863115A CN1924994B (en) 2005-08-31 2005-08-31 Embedded language synthetic method and system

Publications (2)

Publication Number Publication Date
CN1924994A CN1924994A (en) 2007-03-07
CN1924994B true CN1924994B (en) 2010-11-03

Family

ID=37817603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2005100863115A Active CN1924994B (en) 2005-08-31 2005-08-31 Embedded language synthetic method and system

Country Status (1)

Country Link
CN (1) CN1924994B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063897B (en) * 2010-12-09 2013-07-03 北京宇音天下科技有限公司 Sound library compression for embedded type voice synthesis system and use method thereof
CN103077704A (en) * 2010-12-09 2013-05-01 北京宇音天下科技有限公司 Voice library compression and use method for embedded voice synthesis system
CN102201232A (en) * 2011-06-01 2011-09-28 北京宇音天下科技有限公司 Voice database structure compression used for embedded voice synthesis system and use method thereof
CN102779508B (en) * 2012-03-31 2016-11-09 科大讯飞股份有限公司 Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN104575506A (en) * 2014-08-06 2015-04-29 闻冰 Speech coding method based on phonetic transcription
CN104916281B (en) * 2015-06-12 2018-09-21 科大讯飞股份有限公司 Big language material sound library method of cutting out and system
CN105551481B (en) * 2015-12-21 2019-05-31 百度在线网络技术(北京)有限公司 The prosodic labeling method and device of voice data
CN107578772A (en) * 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 Merge acoustic feature and the pronunciation evaluating method and system of pronunciation movement feature
CN108831437B (en) * 2018-06-15 2020-09-01 百度在线网络技术(北京)有限公司 Singing voice generation method, singing voice generation device, terminal and storage medium
CN108899009B (en) * 2018-08-17 2020-07-03 百卓网络科技有限公司 Chinese speech synthesis system based on phoneme

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1089045A (en) * 1992-12-30 1994-07-06 北京海淀施达测控技术公司 The computer speech of Chinese-character text is monitored and critique system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1089045A (en) * 1992-12-30 1994-07-06 北京海淀施达测控技术公司 The computer speech of Chinese-character text is monitored and critique system

Also Published As

Publication number Publication date
CN1924994A (en) 2007-03-07

Similar Documents

Publication Publication Date Title
CN1924994B (en) Embedded language synthetic method and system
JP5768093B2 (en) Speech processing system
Lu et al. Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis
Abushariah et al. Natural speaker-independent Arabic speech recognition system based on Hidden Markov Models using Sphinx tools
CN107154260B (en) Domain-adaptive speech recognition method and device
Nwe et al. Speech based emotion classification
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
Casale et al. Speech emotion classification using machine learning algorithms
CN106531150B (en) Emotion synthesis method based on deep neural network model
CN101064103B (en) Chinese voice synthetic method and system based on syllable rhythm restricting relationship
CN101490740B (en) Audio combining device
JPH06175696A (en) Device and method for coding speech and device and method for recognizing speech
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
CN113724718B (en) Target audio output method, device and system
CN106295717A (en) A kind of western musical instrument sorting technique based on rarefaction representation and machine learning
CN102810311A (en) Speaker estimation method and speaker estimation equipment
Shahin et al. Talking condition recognition in stressful and emotional talking environments based on CSPHMM2s
CN106297766B (en) Phoneme synthesizing method and system
CN102063897B (en) Sound library compression for embedded type voice synthesis system and use method thereof
US9484045B2 (en) System and method for automatic prediction of speech suitability for statistical modeling
CN117524259A (en) Audio processing method and system
Bharti et al. Automated speech to sign language conversion using Google API and NLP
Shaikh Nilofer et al. Automatic emotion recognition from speech signals: A Review
Ong et al. Malay language speech recogniser with hybrid hidden markov model and artificial neural network (HMM/ANN)
CN114120973B (en) Training method for voice corpus generation system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20170503

Address after: 100094, No. 4, building A, No. 1, building 2, wing Cheng North Road, No. 405-346, Beijing, Haidian District

Patentee after: Beijing Rui Heng Heng Xun Technology Co., Ltd.

Address before: 100080 Zhongguancun East Road, Beijing, No. 95, No.

Patentee before: Institute of Automation, Chinese Academy of Sciences

TR01 Transfer of patent right

Effective date of registration: 20181217

Address after: 100190 Zhongguancun East Road, Haidian District, Haidian District, Beijing

Patentee after: Institute of Automation, Chinese Academy of Sciences

Address before: 100094 No. 405-346, 4th floor, Building A, No. 1, Courtyard 2, Yongcheng North Road, Haidian District, Beijing

Patentee before: Beijing Rui Heng Heng Xun Technology Co., Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190527

Address after: 310019 1105, 11 / F, 4 building, 9 Ring Road, Jianggan District nine, Hangzhou, Zhejiang.

Patentee after: Limit element (Hangzhou) intelligent Polytron Technologies Inc

Address before: 100190 Zhongguancun East Road, Haidian District, Haidian District, Beijing

Patentee before: Institute of Automation, Chinese Academy of Sciences

TR01 Transfer of patent right
CP01 Change in the name or title of a patent holder

Address after: 310019 1105, 11 / F, 4 building, 9 Ring Road, Jianggan District nine, Hangzhou, Zhejiang.

Patentee after: Zhongke extreme element (Hangzhou) Intelligent Technology Co., Ltd

Address before: 310019 1105, 11 / F, 4 building, 9 Ring Road, Jianggan District nine, Hangzhou, Zhejiang.

Patentee before: Limit element (Hangzhou) intelligent Polytron Technologies Inc.

CP01 Change in the name or title of a patent holder