Summary of the invention
In view of the deficiency of prior art, the purpose of this invention is to provide a kind of method and apparatus, be used for Chinese text to the sound bank of speech conversion system is compressed to a few megabyte levels.
Further purpose of the present invention provides a kind of compression method and device of the waveform of syllable efficiently, reduces data volume and the distortion of syllable waveform when storage.
According to a kind of method of compressing the sound bank of Chinese text to the speech conversion system of the present invention, comprise from a pronunciation storehouse and collect a plurality of syllables that described syllable is divided into a plurality of syllable groups, and wherein the syllable in each group has identical phonetic; Similarity according to the prosodic features of whole syllables in the described syllable group, with the syllabification in the described syllable group is a plurality of syllable groups, the syllabification that characteristic is similar can utilize these similaritys to be further processed for the syllable in the child group to same height group; In each described son group, select syllable representative, represent in this child group other common syllable; The described selected syllable representative of compression storage is to form compressing voice library; In each son group, calculate the prosodic features difference between the representative of common syllable and syllable, and in described compressing voice library, store described prosodic features difference.The syllabification that rhythm characteristic is similar of this method substitutes a plurality of similar syllables to one group with syllable representative, can effectively reduce required data quantity stored, thus conserve storage.
According to the method for compressing voice library of the present invention, wherein said prosodic features comprises: this accent of syllable, gene profile, duration, energy (rms amplitude) and phonetics/coarticulation environmental parameter.The sub-partition method of described syllable comprises: phonetics grouping (PhoneticClustering) and mixing acoustics/phonetics grouping (Hybrid Acoustic/PhoneticClustering), wherein the phonetics grouping is based on the similarity of this mediation phonetics/coarticulation environmental parameter of syllable, and mixing acoustics/phonetics grouping is to be weighted by the prosodic features to syllable to carry out.By calculating and analyze the prosodic features difference between the different syllables, repartition the son group to reduce the quantity of described syllable representative.In this step, can reduce the quantity of the syllable representative of required storage effectively by the quantity of siding stopping group.For common syllable, only the syllable representative of this common syllable and the prosodic features difference between this common syllable and the representative of this syllable are represented in storage, when synthetic, just can recover original common syllable, in less storage data volume effectively, guaranteed the natural pronunciation characteristic of synthetic speech.
Method according to compressing voice library of the present invention also comprises: to described syllable representative carrying out Auditory estimating, if dissatisfied to described syllable representative, then syllable group is represented and/or repartitioned to the gravity treatment syllable, otherwise, store described syllable representative.For the situation that bad syllable representative may occur, Auditory estimating provides an effective indemnifying measure.
Method according to compressing voice library of the present invention further comprises: described syllable representative is divided into voiceless sound part (unvoiced portion) and voiced sound part (voiced portion), this is according to the syllable waveform characteristic: what the voiceless sound of syllable part generally was positioned at waveform opens initial portion (this part is called as initial consonant), and the voiced sound part of syllable then is positioned at the rear portion (this part is called as simple or compound vowel of a Chinese syllable) of waveform; Described voiceless sound part is directly stored in the waveform mode, and voiceless sound partial data amount is little, and it is good directly to store fidelity; Partly utilize the Parametric Analysis device to compress described voiced sound, the waveforms amplitude of voiced sound part is big, tone period is long, and data volume is big, utilizes the Parametric Analysis device to compress and can effectively reduce the quality that the storage data volume can guarantee to utilize simultaneously the parameter synthetic speech again.Prior art directly utilizes the Parametric Analysis device to compress the whole pronunciation waveform of syllable representative, and voiceless sound partly was easy to generate distortion when compressibility was big.The present invention is divided into voiced sound part and voiceless sound part with the pronunciation waveform of syllable representative, handle respectively then, the syllable waveform quality synthetic once more according to this method, the quality of synthetic syllable waveform significantly improves in than prior art under the much compression rate, particularly the voiceless sound part.
Method according to compressing voice library of the present invention further comprises again: synthetic described again voiced sound part and the representative of described syllable, and to syllable representative carrying out Auditory estimating, if it is dissatisfied to synthetic syllable representative, then revise the code book that is used for the voiced sound part, otherwise storage is through mixing the syllable representative of compression.For with the syllable waveform transformation being the parameter storage, synthetic again when needed then, must cause distortion more or less.The present invention further utilize Auditory estimating by way of compensation measure reduce unnecessary distortion, to obtain the effect of natural pronunciation.
The present invention also provides a kind of device that compresses the sound bank of Chinese text to the speech conversion system, comprise the device that is used for collecting a plurality of syllables from a pronunciation storehouse, be used for described syllable is divided into the apparatus for grouping of a plurality of syllable groups, wherein the syllable in each group has identical phonetic; Being used for the similarity according to the prosodic features of the whole syllables of described syllable group, is the child group classification apparatus of a plurality of syllables groups with the syllabification in the described syllable group; Be used for selecting syllable representative, represent the syllable of other common syllable in this group to represent selecting arrangement in each described son group; Memory storage is used to store described selected syllable representative, to form compressing voice library; And prosodic features difference treating apparatus, be used for calculating the prosodic features difference between the representative of common syllable and syllable, and in described compressing voice library, store described prosodic features difference in each son group.
Sound bank compression set of the present invention also comprises: the weighted calculation device, be used for the prosodic features of syllable is weighted calculating, weighting function is W={Wt, Wp, Wd, We, Wy}, wherein Wt is the weighting to this accent of syllable, Wp is the weighting to syllable fundamental tone profile, Wd is the weighting to the syllable duration, and We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter.This device is repartitioned the son group to reduce the quantity of described syllable representative by calculating and analyze the prosodic features difference between the different syllables.
Sound bank compression set of the present invention also comprises: the pronunciation classification apparatus is used for described syllable representative is divided into a voiceless sound part and a voiced sound part; The waveform processing device is used for described voiceless sound part is directly stored in the waveform mode, and voiced sound is partly compressed the back storage with the Parametric Analysis device; The Auditory estimating device, be used for described syllable representative carrying out Auditory estimating, if it is dissatisfied to described syllable representative, then use syllable to represent representative of selecting arrangement gravity treatment syllable and/or use syllable group classification apparatus to repartition syllable group, otherwise, use the described syllable representative of memory device stores; Synthesizer is used for again synthetic described voiced sound part and the representative of described syllable; And apparatus for evaluating, be used for described synthetic syllable representative carrying out Auditory estimating, to obtain satisfied syllable representative.
Utilize method and apparatus of the present invention, spliced Chinese text to the speech syllable storehouse of speech conversion system can be compressed to several megabyte.And the splicing synthetic speech that the sound bank that uses the method according to this invention and device to obtain obtains has the feature of natural pronunciation.The syllable representative is used and is mixed the syllable waveform compression, and its synthetic syllable quality is significantly better than the quality of utilizing the synthetic syllable that direct syllable compression scheme obtains in the prior art.
The present invention is applicable to various portable set, realizes the device of Chinese text to speech conversion as mobile phone, portable electronic dictionary, portable interpreting equipment, PDA(Personal Digital Assistant), hand held personal computer, desktop PC and various use embedded functional module.
Chinese text to speech conversion is a critical function of handheld device.Embedded natural pronunciation Chinese text can improve the competitive power of handheld device to speech conversion system.The invention provides a kind of selection of novelty and produce the solution in pronunciation data storehouse.
Embodiment
The present invention has adopted the syllable grouping and has mixed the syllable waveform compression and generated sound bank, and this sound bank only uses memory source seldom, can be used for high-quality embedded text to speech conversion system.The rhythm modification different with prior art, that the present invention has adopted a kind of new scheme to reduce the storage of voice unit, carried out data compression and carry out the syllable level.Technical scheme of the present invention mainly comprises part: utilize the method for syllable grouping to reduce voice unit; Mix the syllable waveform compression with utilization and come compressing voice library.
Below with reference to Fig. 1, describe the method for Chinese character syllable grouping of the present invention in detail.
In traditional tts system, in sound bank, stored the syllable waveform all recorded or waveform has directly been utilized syllable waveform parameter after the compression of Parametric Analysis device.The present invention has used the syllable grouping and has mixed the size that the syllable waveform compression reduces this sound bank according to the spectral characteristic of total characteristic between the syllable and syllable self.
Utilize the synthetic required pronunciation waveform of syllable of the present invention generally to be selected from the very big pronunciation waveform library of a data volume, the size of this database depends on that required pronunciation synthesizes quality.Required synthetic voice quality is high more, then needs the original transcription waveform that uses many more.Store various Chinese sentences, phrase in this pronunciation waveform library, and the waveform that pronounces accordingly.
As shown in Figure 1, in step 1.1, collect a plurality of syllables from a pronunciation storehouse, described syllable is divided into a plurality of syllable groups, wherein the syllable in each group has identical phonetic, forms N syllable group; Each syllable group comprises M
n(n=1,2 ..., N) individual syllable.In this step, can not consider the tone of these syllables.
For example, the syllable that phonetic is identical is assigned to one group, can obtain following N syllable group, and each syllable group comprises M
n(n=1,2 ..., N) individual syllable.The size of M is according to different syllable groups and difference.I byte of n group is designated as S
N, i, n=1,2 ... N, i=1,2 ..., M.
N=1: (a5) Ah (a2) ... (a5)
N=2: like that (ai4) short (ai3) suffers (ai2) sound of sighing (ai5) and hinder (ai4) Chinese mugwort (ai3)
Friendly (ai3)
…
N=10: (ba3) eight (ba1) (ba5) father (ba4) is stopped (ba4)
… …
N=N (452): catch (zhuo1) table (zhuo1) and peck (zhuo2) clumsy (zhuo2) ... therefore, syllable S
2.3The 3rd syllable of second syllable group of expression promptly " suffers (ai2) ".
In step 1.2, obtain the prosodic features vector (X) of each syllable in each syllable group.This prosodic features vector (X) comprises this accent (Lexical tone, t
i), gene profile (Pitchcontour, p
i), duration (Duration, d
i), energy (that is, rms amplitude, Root meansquare of amplitude, e
i) and phonetics/coarticulation environmental parameter (Phonetic/co-articulatory environment identity, y
i).
In above-mentioned prosodic features vector, this accent (Lexical tone, t
i), be fundamental note, expression be theoretic pronunciation.The Chinese syllable has five kinds of tones: one (high and level tone), two (rising tone), three (go up sound), the four tones of standard Chinese pronunciation (falling tone) and softly.For example, " ba3 " expression has three phonetic " ba ".The fundamental tone profile is the sound expression behaviour of tone, it be the basic frequency of pronunciation fragment with respect to the function of time, be a vector.Actual fundamental tone profile is complied with concrete context language environment and difference, and fundamental tone profile softly depends primarily on this accent at a syllable of its front.Duration is the tolerance of a syllable pronunciation fragment duration length, is a scalar.Rms amplitude, the rms amplitude of a syllable pronunciation fragment is the tolerance of pronunciation waveform energy, also is a scalar.Phonetics/coarticulation environmental parameter is a vector, and composition wherein comprises: the position of syllable in sentence, phrase or speech, the type of follow-up syllable (that is, open initial portion be the syllable of voiced sound or open the syllable that initial portion is voiceless sound).
In step 1.3, similarity according to the prosodic features of whole syllables in the described syllable group, with the syllabification in the described syllable group is a plurality of syllable groups, and the syllabification that characteristic is similar can utilize these similaritys to be further processed for the syllable in the child group to same height group.In this step,, the syllable in the same syllable group is divided into K according to the phonetic similarity of each syllable
1The individual first son group (being designated as H), (y) is similar with phonetics/coarticulation environmental parameter for this accent (t) of the syllable in the wherein same first son group, and this step is called the phonetics grouping.
In step 1.4, mix acoustics/phonetics grouping, this step is weighted by the prosodic features to syllable carries out.By calculating and analyze the prosodic features difference between the different syllables, repartition the son group to reduce the quantity of described syllable representative.By vector quantization (VQ) algorithm, further with described K
1Syllable in the individual first son group is grouped into K again
2The individual second son group (being designated as L), target is K
2<K
1K
2Size depend on the quantity K of syllable quantity M in each syllable group, the first son group
1And the target sizes of sound bank.By limiting the quantity of the second son group, promptly limit the quantity of target group in the sound bank, the quantity of the voice unit of being stored in can the limited target sound bank.In this Vector Quantization algorithm, used weighting function W={Wt, Wp, Wd, We, Wy} is weighted calculating to the prosodic features of syllable, wherein Wt is the weighting to this accent of syllable, and Wp is the weighting to syllable fundamental tone profile, and Wd is the weighting to the syllable duration, We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter.After the syllable rhythm eigenvector is weighted, measure different syllable rhythm eigenvector X
N, IBetween difference.According to prosodic features vector X
N, IBetween difference, the syllable in the first son group is divided into groups again, in the syllabification to that prosodic features vector after the weighting is the similar son group, form a plurality of second son groups.This step is called morbid sound/phonetics grouping.
In step 1.5, among the common syllable of each described second son group L, select a common syllable to represent R as the syllable of this child group.Then, calculate difference rhythm vector V between the representative of each common syllable and this syllable.When selecting syllable to represent, both can also can use automated process by end user's construction method.When using automated process, the standard that the mean value of prosodic features vector can be represented as selection syllable candidate.That is to say, as the standard of selecting syllable candidate representative, each common syllable all compares its prosodic features vector in this child group with this average with the average of gene profile, duration, rms amplitude and the phonetics/coarticulation environmental parameter of all common syllables in this child group.The common syllable that preferred prosodic features vector and this average difference are little is represented as syllable.Calculate the prosodic features vector difference between the representative of the syllable in each common syllable and this child group in each height group.Each common syllable in each son group just can be represented with the syllable representative and the corresponding prosodic features vector difference of this child group like this.
The syllabification that rhythm characteristic is similar of this method substitutes a plurality of similar syllables to one group with syllable representative, can effectively reduce required data quantity stored, thus conserve storage.For common syllable, only the syllable representative of this common syllable and the prosodic features difference between this common syllable and the representative of this syllable are represented in storage, when synthetic, just can recover original common syllable, in less storage data volume effectively, guaranteed the natural pronunciation characteristic of synthetic speech.
Above step has tentatively been finished syllable grouping in the pronunciation waveform library and syllable representative has been selected, and has reached the purpose that reduces voice unit basically.
In step 1.8 (comprising step 1.3-1.7), iterative modifications is carried out in the division of selected syllable representative and son group in conjunction with Auditory estimating.For the situation that bad syllable representative may occur, Auditory estimating provides an effective indemnifying measure.Wherein in step 1.6, Auditory estimating is carried out in selected syllable representative, listen to the waveform of grouping syllable representative and its tone pattern is tested.If assessment result is dissatisfied, then get back to phonetics grouping step 1.3, make amendment for ropy grouping, repartition son group or the representative of gravity treatment group syllable.
If assessment result is satisfied, then export a plurality of syllables second son group that obtains.This output result comprise the representative of corresponding syllable and should the child group in the rhythm vector difference of each common syllable between representing with this syllable.
Hybrid waveform compression of the present invention is to carry out according to the waveform characteristic of syllable pronunciation.In Chinese, the pronunciation waveform of a syllable generally comprises two parts, voiceless sound part (Unvoicedportion) and voiced sound part (Voiced portion).The voiceless sound part generally is positioned at the front portion of pronunciation waveform, and the voiced sound part generally is positioned at the rear portion of pronunciation waveform, and these two parts are positioned at the diverse location of syllable significantly, thereby can handle respectively them.Have only the syllable of voiced sound part for those, can directly carry out Parametric Analysis and handle it.
In addition, the voiceless sound partial amt is limited, and generally shares (having) by different syllables.Different with the voiced sound part, the waveform character of voiceless sound part is similar to noise signal, and the amplitude of voiceless sound portion waveshape signal is much smaller than the amplitude of voiced sound portion waveshape signal.After the Parametric Analysis device compressed this signal that is similar to noise, the data that can't guarantee to utilize this compression were with it is synthetic again and distortion is very little.That is to say, utilize the Parametric Analysis device that above-mentioned voiceless sound is partly compressed, can't guarantee that synthetic again voice have natural pronunciation quality (true man's voice quality).Therefore, in order to ensure the synthetic speech that generates natural pronunciation, the present invention partly utilizes the Parametric Analysis device to compress the back with the parameter mode storage voiced sound, and the voiceless sound part is stored in the waveform mode.
In addition, a favourable aspect is the length that voiceless sound length partly is generally less than the voiced sound part.So, the voiceless sound part is stored in the waveform mode, rather than it is carried out storing its parameter after the Parametric Analysis, do not increase considerably memory data output, but the synthetic speech quality that obtains is better.With less memory capacity is cost, has obtained effect preferably.And the difference between the prosodic features vector between the syllable is typically implemented in its voiced sound part, generally is used for the voiced sound part so the rhythm is revised.This factor need also allow to store in a different manner voiced sound part and voiceless sound part.
Below in conjunction with Fig. 2, describe the compression method of mixing syllable of the present invention (waveform) in detail.Fig. 2 is the process flow diagram according to hybrid waveform compression of the present invention.
In step 2.1, the waveform of each syllable being represented according to the waveform characteristic that is syllable is divided into two parts: voiced sound part (Voiced portion, W
v) and voiceless sound part (Unvoiced portion, W
u), this step is called syllable and splits.The syllable that has may not have the voiceless sound part, then directly handles its voiced sound part.Wherein the voiceless sound part is generally in the initial portion of opening of syllable, and signal amplitude is less, with respect to the voiced sound part, more similarly is noise signal; And the voiced sound part generally is positioned at the rear portion of syllable, and signal amplitude is stronger with respect to the voiceless sound part.Directly store its waveform for the voiceless sound part, compress by following steps for the voiced sound part.
In step 2.6 (comprising step 2.2-2.5), utilize the Parametric Analysis device that the waveform of above-mentioned voiced sound part is carried out Parametric Analysis, be designed for the code book of Parametric Analysis simultaneously.This code book depends on the people and the sound bank of recording.In conjunction with Auditory estimating, code book is carried out iterative modifications then, this step is called the code book design.
In step 2.2, utilize the Parametric Analysis device that described voiced sound is partly analyzed, obtain described voiced sound part (W
v) parameter and code book thereof, these parameters are stored in the sound bank, this step is called voiced sound portion waveshape compression.
In step 2.3, according to voiced sound parameter and the code book partly that step 2.2 obtains, synthetic again voiced sound part.The voiced sound part can be used and come again synthetic with described Parametric Analysis device relevant parameters compositor.Voiceless sound part and voiced sound partly are stitched together, just can obtain a complete syllable waveform.
In step 2.4, with the parameter and the corresponding voiceless sound part combination of described voiced sound part, synthesize the waveform of a complete syllable, and it is carried out Auditory estimating.
If assessment result is dissatisfied, then carry out step 2.5, revise the code book of described voiced sound part, carry out step 2.2 again then, the Parametric Analysis device utilizes amended code book again voiced sound partly to be carried out Parametric Analysis.If assessment result is satisfied, the parameter of the resulting described voiced sound part of output parameter fractional analysis then can obtain mixing the syllable parameter of compression.
Utilize above-mentioned method and apparatus, spliced Chinese text to the speech syllable storehouse of speech conversion system may be compressed to several megabyte.Use the synthetic syllable that mixes syllable waveform compression scheme, its quality is better than the quality of utilizing the synthetic syllable that direct syllable compression scheme obtains in the prior art, has the feature of natural pronunciation.
It will be appreciated by those skilled in the art that hybrid waveform compression method of the present invention both can be used in combination with above-mentioned syllable group technology, compress the waveform of syllable representative in each syllable group; Also can use separately, come the syllable waveform is compressed according to real needs.
Briefly introduce device below in conjunction with Fig. 3 and Fig. 4 according to compressing voice library of the present invention.According to the device that is used for compressing voice library of the present invention as shown in Figure 3.This device is a kind of device that compresses the sound bank of Chinese text to the speech conversion system, comprise the device (not shown) that is used for collecting a plurality of syllables from a pronunciation storehouse, the device of described compressing voice library also comprises: syllable apparatus for grouping 31, be used for described syllable is divided into a plurality of syllable groups, wherein the syllable in each group has identical phonetic (Phonetic Spelling); Son group classification apparatus, comprise phonetics apparatus for grouping 33 and mix apparatus for grouping 34, be used for similarity according to the prosodic features of the whole syllables of described syllable group, with the syllabification in the described syllable group is a plurality of syllable groups, wherein the phonetics grouping is based on the similarity of this mediation phonetics/coarticulation environmental parameter of syllable, mixing acoustics/phonetics grouping is to be weighted by the prosodic features to syllable to carry out, weighting function is W={Wt, Wp, Wd, We, Wy}, wherein Wt is the weighting to this accent of syllable, Wp is the weighting to syllable fundamental tone profile, Wd is the weighting to the syllable duration, and We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter; Selecting arrangement 35 represented in syllable, is used for selecting syllable representative in each described son group, represents in this group other common syllable; The memory storage (not shown) is used to store described selected syllable representative, to form compressing voice library.
The device that is used for compressing voice library also comprises: prosodic features difference treating apparatus (not shown), be used in each son group, calculate the prosodic features difference between the representative of common syllable and syllable, and in described compressing voice library, store described prosodic features difference; Auditory estimating device 36 represented in syllable, is used for selected syllable representative carrying out Auditory estimating; Judgment means 39 represented in syllable, is used for judging according to assessment result, if assessment result is satisfied, then exports this syllable representative, otherwise, utilize modifier to operate; And modifier, be used for providing modification information to the selection of grouping and syllable representative.
Be used to compress device that syllable represents waveform as shown in Figure 4 according to of the present invention, this device comprises: pronunciation classification apparatus 51 is used for described syllable representative is divided into a voiceless sound part and a voiced sound part; And waveform processing device 56, be used for described voiceless sound part is directly stored in the waveform mode, voiced sound is partly compressed the back storage with the Parametric Analysis device.
Wherein waveform processing device 56 comprises pronunciation synthesizer 53, is used for again synthetic described voiced sound part, and the syllable that obtains synthesizing after voiced sound part and voiceless sound partly spliced is represented; Synthetic syllable Auditory estimating device 54 is used for the voiced sound part and the voiceless sound of synthetic syllable representative are partly carried out Auditory estimating, to obtain the syllable representative of satisfied mixing compression; The code book modifier, the assessment result that is used for drawing at apparatus for evaluating 54 is to revise the code book that Parametric Analysis device 52 is used for the compress voiced portion waveshape for dissatisfied.
Protection domain of the present invention is illustrated in the appended claims.But every within aim of the present invention, conspicuous modification is also due within protection scope of the present invention.