CN1238805C - Method and apparatus for compressing voice library - Google Patents

Method and apparatus for compressing voice library Download PDF

Info

Publication number
CN1238805C
CN1238805C CN 02127004 CN02127004A CN1238805C CN 1238805 C CN1238805 C CN 1238805C CN 02127004 CN02127004 CN 02127004 CN 02127004 A CN02127004 A CN 02127004A CN 1238805 C CN1238805 C CN 1238805C
Authority
CN
China
Prior art keywords
syllable
representative
group
syllables
voice library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CN 02127004
Other languages
Chinese (zh)
Other versions
CN1471027A (en
Inventor
俞振利
岳东剑
黄建成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Serenes Operations
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Priority to CN 02127004 priority Critical patent/CN1238805C/en
Publication of CN1471027A publication Critical patent/CN1471027A/en
Application granted granted Critical
Publication of CN1238805C publication Critical patent/CN1238805C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present invention relates to a method for compressing Chinese text to a voice library in a voice conversion system, which comprises the following steps: collecting multiple syllables from a pronunciation library, dividing the syllables into multiple syllable groups each with the same pinyin; according to similarity in prosodic features of all syllables in the syllable groups, dividing the syllables in the syllable groups into multiple syllable sub-groups; selecting a representative syllable from each sub-group to represent all other general syllables in the group; storing the selected representative syllables to form a compressed voice library. The method and a device of the present invention can compress the split Chinese text to a plurality of MB and into the voice syllable library of the voice conversion system. The present invention compresses the representative syllables via mixed syllable waveform, divides the syllables into a voiced sound part and an unvoiced sound part for separate processing. The syllables synthesized through the voice library obtained from the present invention have the natural voice characteristic.

Description

The method and apparatus that is used for compressing voice library
Technical field
The present invention relates to the compression method and the device of sound bank.Relate in particular to the method and apparatus that Chinese text is compressed to the sound bank in the speech conversion system.
Background technology
In computing machine, especially in hand-hold electronic equipments, Chinese text is converted to voice output, can be convenient to the user and carry out the man-machine communication.Present Chinese text to speech conversion can be converted to machine pronunciation or natural pronunciation (simulation true man pronunciation).Chinese text to speech conversion generally all needs a sound bank, and the data in input text and this sound bank are mated, and utilizes the pronunciation data in the sound bank to come synthetic speech then.Machine pronounces, and required sound bank is simple relatively, data volume is few, but it is stiff to pronounce.Simulation true man pronunciation can obtain high-quality natural pronunciation, but required sound bank complexity, data volume are big.Since the restriction of volume and cost, hand-hold electronic equipments, and as mobile phone, PDA(Personal Digital Assistant), its storage space is all smaller, is generally several megabyte.The limitations restrict of memory space requirements Chinese text to the application of speech conversion in these handheld devices.
At present, obtain high-quality natural pronunciation, generally the text and a sound bank of input need be mated, find out corresponding syllable (as basic pronunciation unit) waveform, then these syllable waveforms be spliced as basic pronunciation unit.Use the system of this joining method to be called spliced Chinese text to speech conversion system.In this system,, generally use sound bank or syllable storehouse to be stored in the syllable waveform of record under various voice and the rhythm environment in order to carry out the syllable waveform concatenation.And in order to generate the synthetic speech of nature sounding, must a large amount of syllable waveform of storage in the sound bank.For such sound bank, often need the storage space of hundreds of million.For example, for the medium-scale sound bank that uses 100,000 syllables, probably preserve the waveform of these syllables with regard to the storer that needs the hundreds of megabyte.
For conserve storage, prior art has used the Parametric Analysis device that speech waveform is compressed, and above-mentioned speech waveform can be compressed to the level of tens megabyte.If the increasing ratio of compression will be a cost with the quality of sacrificing synthetic speech then.At present, in prior art also without any method, under the situation of the level that this sound bank is compressed to several megabyte, still can by the input the high-quality natural pronunciation synthetic speech of text generating.
Since the restriction of volume and cost, hand-hold electronic equipments, and as mobile phone, portable electronic dictionary, PDA(Personal Digital Assistant) etc., its storage space is all smaller, is generally several megabyte.The limitations restrict of memory space requirements Chinese text to the application of speech conversion in these handheld devices.Because prior art can't be compressed to this sound bank the level of several megabyte.
Summary of the invention
In view of the deficiency of prior art, the purpose of this invention is to provide a kind of method and apparatus, be used for Chinese text to the sound bank of speech conversion system is compressed to a few megabyte levels.
Further purpose of the present invention provides a kind of compression method and device of the waveform of syllable efficiently, reduces data volume and the distortion of syllable waveform when storage.
According to a kind of method of compressing the sound bank of Chinese text to the speech conversion system of the present invention, comprise from a pronunciation storehouse and collect a plurality of syllables that described syllable is divided into a plurality of syllable groups, and wherein the syllable in each group has identical phonetic; Similarity according to the prosodic features of whole syllables in the described syllable group, with the syllabification in the described syllable group is a plurality of syllable groups, the syllabification that characteristic is similar can utilize these similaritys to be further processed for the syllable in the child group to same height group; In each described son group, select syllable representative, represent in this child group other common syllable; The described selected syllable representative of compression storage is to form compressing voice library; In each son group, calculate the prosodic features difference between the representative of common syllable and syllable, and in described compressing voice library, store described prosodic features difference.The syllabification that rhythm characteristic is similar of this method substitutes a plurality of similar syllables to one group with syllable representative, can effectively reduce required data quantity stored, thus conserve storage.
According to the method for compressing voice library of the present invention, wherein said prosodic features comprises: this accent of syllable, gene profile, duration, energy (rms amplitude) and phonetics/coarticulation environmental parameter.The sub-partition method of described syllable comprises: phonetics grouping (PhoneticClustering) and mixing acoustics/phonetics grouping (Hybrid Acoustic/PhoneticClustering), wherein the phonetics grouping is based on the similarity of this mediation phonetics/coarticulation environmental parameter of syllable, and mixing acoustics/phonetics grouping is to be weighted by the prosodic features to syllable to carry out.By calculating and analyze the prosodic features difference between the different syllables, repartition the son group to reduce the quantity of described syllable representative.In this step, can reduce the quantity of the syllable representative of required storage effectively by the quantity of siding stopping group.For common syllable, only the syllable representative of this common syllable and the prosodic features difference between this common syllable and the representative of this syllable are represented in storage, when synthetic, just can recover original common syllable, in less storage data volume effectively, guaranteed the natural pronunciation characteristic of synthetic speech.
Method according to compressing voice library of the present invention also comprises: to described syllable representative carrying out Auditory estimating, if dissatisfied to described syllable representative, then syllable group is represented and/or repartitioned to the gravity treatment syllable, otherwise, store described syllable representative.For the situation that bad syllable representative may occur, Auditory estimating provides an effective indemnifying measure.
Method according to compressing voice library of the present invention further comprises: described syllable representative is divided into voiceless sound part (unvoiced portion) and voiced sound part (voiced portion), this is according to the syllable waveform characteristic: what the voiceless sound of syllable part generally was positioned at waveform opens initial portion (this part is called as initial consonant), and the voiced sound part of syllable then is positioned at the rear portion (this part is called as simple or compound vowel of a Chinese syllable) of waveform; Described voiceless sound part is directly stored in the waveform mode, and voiceless sound partial data amount is little, and it is good directly to store fidelity; Partly utilize the Parametric Analysis device to compress described voiced sound, the waveforms amplitude of voiced sound part is big, tone period is long, and data volume is big, utilizes the Parametric Analysis device to compress and can effectively reduce the quality that the storage data volume can guarantee to utilize simultaneously the parameter synthetic speech again.Prior art directly utilizes the Parametric Analysis device to compress the whole pronunciation waveform of syllable representative, and voiceless sound partly was easy to generate distortion when compressibility was big.The present invention is divided into voiced sound part and voiceless sound part with the pronunciation waveform of syllable representative, handle respectively then, the syllable waveform quality synthetic once more according to this method, the quality of synthetic syllable waveform significantly improves in than prior art under the much compression rate, particularly the voiceless sound part.
Method according to compressing voice library of the present invention further comprises again: synthetic described again voiced sound part and the representative of described syllable, and to syllable representative carrying out Auditory estimating, if it is dissatisfied to synthetic syllable representative, then revise the code book that is used for the voiced sound part, otherwise storage is through mixing the syllable representative of compression.For with the syllable waveform transformation being the parameter storage, synthetic again when needed then, must cause distortion more or less.The present invention further utilize Auditory estimating by way of compensation measure reduce unnecessary distortion, to obtain the effect of natural pronunciation.
The present invention also provides a kind of device that compresses the sound bank of Chinese text to the speech conversion system, comprise the device that is used for collecting a plurality of syllables from a pronunciation storehouse, be used for described syllable is divided into the apparatus for grouping of a plurality of syllable groups, wherein the syllable in each group has identical phonetic; Being used for the similarity according to the prosodic features of the whole syllables of described syllable group, is the child group classification apparatus of a plurality of syllables groups with the syllabification in the described syllable group; Be used for selecting syllable representative, represent the syllable of other common syllable in this group to represent selecting arrangement in each described son group; Memory storage is used to store described selected syllable representative, to form compressing voice library; And prosodic features difference treating apparatus, be used for calculating the prosodic features difference between the representative of common syllable and syllable, and in described compressing voice library, store described prosodic features difference in each son group.
Sound bank compression set of the present invention also comprises: the weighted calculation device, be used for the prosodic features of syllable is weighted calculating, weighting function is W={Wt, Wp, Wd, We, Wy}, wherein Wt is the weighting to this accent of syllable, Wp is the weighting to syllable fundamental tone profile, Wd is the weighting to the syllable duration, and We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter.This device is repartitioned the son group to reduce the quantity of described syllable representative by calculating and analyze the prosodic features difference between the different syllables.
Sound bank compression set of the present invention also comprises: the pronunciation classification apparatus is used for described syllable representative is divided into a voiceless sound part and a voiced sound part; The waveform processing device is used for described voiceless sound part is directly stored in the waveform mode, and voiced sound is partly compressed the back storage with the Parametric Analysis device; The Auditory estimating device, be used for described syllable representative carrying out Auditory estimating, if it is dissatisfied to described syllable representative, then use syllable to represent representative of selecting arrangement gravity treatment syllable and/or use syllable group classification apparatus to repartition syllable group, otherwise, use the described syllable representative of memory device stores; Synthesizer is used for again synthetic described voiced sound part and the representative of described syllable; And apparatus for evaluating, be used for described synthetic syllable representative carrying out Auditory estimating, to obtain satisfied syllable representative.
Utilize method and apparatus of the present invention, spliced Chinese text to the speech syllable storehouse of speech conversion system can be compressed to several megabyte.And the splicing synthetic speech that the sound bank that uses the method according to this invention and device to obtain obtains has the feature of natural pronunciation.The syllable representative is used and is mixed the syllable waveform compression, and its synthetic syllable quality is significantly better than the quality of utilizing the synthetic syllable that direct syllable compression scheme obtains in the prior art.
The present invention is applicable to various portable set, realizes the device of Chinese text to speech conversion as mobile phone, portable electronic dictionary, portable interpreting equipment, PDA(Personal Digital Assistant), hand held personal computer, desktop PC and various use embedded functional module.
Chinese text to speech conversion is a critical function of handheld device.Embedded natural pronunciation Chinese text can improve the competitive power of handheld device to speech conversion system.The invention provides a kind of selection of novelty and produce the solution in pronunciation data storehouse.
Description of drawings
Fig. 1 has showed according to the present invention the method flow of Chinese syllable grouping.
Fig. 2 has showed the method flow of representing waveform to compress Chinese syllable according to the present invention.
Figure 3 shows that the device block scheme that is used for compressing voice library of the present invention.
Figure 4 shows that according to of the present invention and be used to compress the device that waveform represented in syllable.
Embodiment
The present invention has adopted the syllable grouping and has mixed the syllable waveform compression and generated sound bank, and this sound bank only uses memory source seldom, can be used for high-quality embedded text to speech conversion system.The rhythm modification different with prior art, that the present invention has adopted a kind of new scheme to reduce the storage of voice unit, carried out data compression and carry out the syllable level.Technical scheme of the present invention mainly comprises part: utilize the method for syllable grouping to reduce voice unit; Mix the syllable waveform compression with utilization and come compressing voice library.
Below with reference to Fig. 1, describe the method for Chinese character syllable grouping of the present invention in detail.
In traditional tts system, in sound bank, stored the syllable waveform all recorded or waveform has directly been utilized syllable waveform parameter after the compression of Parametric Analysis device.The present invention has used the syllable grouping and has mixed the size that the syllable waveform compression reduces this sound bank according to the spectral characteristic of total characteristic between the syllable and syllable self.
Utilize the synthetic required pronunciation waveform of syllable of the present invention generally to be selected from the very big pronunciation waveform library of a data volume, the size of this database depends on that required pronunciation synthesizes quality.Required synthetic voice quality is high more, then needs the original transcription waveform that uses many more.Store various Chinese sentences, phrase in this pronunciation waveform library, and the waveform that pronounces accordingly.
As shown in Figure 1, in step 1.1, collect a plurality of syllables from a pronunciation storehouse, described syllable is divided into a plurality of syllable groups, wherein the syllable in each group has identical phonetic, forms N syllable group; Each syllable group comprises M n(n=1,2 ..., N) individual syllable.In this step, can not consider the tone of these syllables.
For example, the syllable that phonetic is identical is assigned to one group, can obtain following N syllable group, and each syllable group comprises M n(n=1,2 ..., N) individual syllable.The size of M is according to different syllable groups and difference.I byte of n group is designated as S N, i, n=1,2 ... N, i=1,2 ..., M.
N=1: (a5) Ah (a2) ... (a5)
N=2: like that (ai4) short (ai3) suffers (ai2) sound of sighing (ai5) and hinder (ai4) Chinese mugwort (ai3)
Friendly (ai3)
N=10: (ba3) eight (ba1) (ba5) father (ba4) is stopped (ba4)
… …
N=N (452): catch (zhuo1) table (zhuo1) and peck (zhuo2) clumsy (zhuo2) ... therefore, syllable S 2.3The 3rd syllable of second syllable group of expression promptly " suffers (ai2) ".
In step 1.2, obtain the prosodic features vector (X) of each syllable in each syllable group.This prosodic features vector (X) comprises this accent (Lexical tone, t i), gene profile (Pitchcontour, p i), duration (Duration, d i), energy (that is, rms amplitude, Root meansquare of amplitude, e i) and phonetics/coarticulation environmental parameter (Phonetic/co-articulatory environment identity, y i).
In above-mentioned prosodic features vector, this accent (Lexical tone, t i), be fundamental note, expression be theoretic pronunciation.The Chinese syllable has five kinds of tones: one (high and level tone), two (rising tone), three (go up sound), the four tones of standard Chinese pronunciation (falling tone) and softly.For example, " ba3 " expression has three phonetic " ba ".The fundamental tone profile is the sound expression behaviour of tone, it be the basic frequency of pronunciation fragment with respect to the function of time, be a vector.Actual fundamental tone profile is complied with concrete context language environment and difference, and fundamental tone profile softly depends primarily on this accent at a syllable of its front.Duration is the tolerance of a syllable pronunciation fragment duration length, is a scalar.Rms amplitude, the rms amplitude of a syllable pronunciation fragment is the tolerance of pronunciation waveform energy, also is a scalar.Phonetics/coarticulation environmental parameter is a vector, and composition wherein comprises: the position of syllable in sentence, phrase or speech, the type of follow-up syllable (that is, open initial portion be the syllable of voiced sound or open the syllable that initial portion is voiceless sound).
In step 1.3, similarity according to the prosodic features of whole syllables in the described syllable group, with the syllabification in the described syllable group is a plurality of syllable groups, and the syllabification that characteristic is similar can utilize these similaritys to be further processed for the syllable in the child group to same height group.In this step,, the syllable in the same syllable group is divided into K according to the phonetic similarity of each syllable 1The individual first son group (being designated as H), (y) is similar with phonetics/coarticulation environmental parameter for this accent (t) of the syllable in the wherein same first son group, and this step is called the phonetics grouping.
In step 1.4, mix acoustics/phonetics grouping, this step is weighted by the prosodic features to syllable carries out.By calculating and analyze the prosodic features difference between the different syllables, repartition the son group to reduce the quantity of described syllable representative.By vector quantization (VQ) algorithm, further with described K 1Syllable in the individual first son group is grouped into K again 2The individual second son group (being designated as L), target is K 2<K 1K 2Size depend on the quantity K of syllable quantity M in each syllable group, the first son group 1And the target sizes of sound bank.By limiting the quantity of the second son group, promptly limit the quantity of target group in the sound bank, the quantity of the voice unit of being stored in can the limited target sound bank.In this Vector Quantization algorithm, used weighting function W={Wt, Wp, Wd, We, Wy} is weighted calculating to the prosodic features of syllable, wherein Wt is the weighting to this accent of syllable, and Wp is the weighting to syllable fundamental tone profile, and Wd is the weighting to the syllable duration, We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter.After the syllable rhythm eigenvector is weighted, measure different syllable rhythm eigenvector X N, IBetween difference.According to prosodic features vector X N, IBetween difference, the syllable in the first son group is divided into groups again, in the syllabification to that prosodic features vector after the weighting is the similar son group, form a plurality of second son groups.This step is called morbid sound/phonetics grouping.
In step 1.5, among the common syllable of each described second son group L, select a common syllable to represent R as the syllable of this child group.Then, calculate difference rhythm vector V between the representative of each common syllable and this syllable.When selecting syllable to represent, both can also can use automated process by end user's construction method.When using automated process, the standard that the mean value of prosodic features vector can be represented as selection syllable candidate.That is to say, as the standard of selecting syllable candidate representative, each common syllable all compares its prosodic features vector in this child group with this average with the average of gene profile, duration, rms amplitude and the phonetics/coarticulation environmental parameter of all common syllables in this child group.The common syllable that preferred prosodic features vector and this average difference are little is represented as syllable.Calculate the prosodic features vector difference between the representative of the syllable in each common syllable and this child group in each height group.Each common syllable in each son group just can be represented with the syllable representative and the corresponding prosodic features vector difference of this child group like this.
The syllabification that rhythm characteristic is similar of this method substitutes a plurality of similar syllables to one group with syllable representative, can effectively reduce required data quantity stored, thus conserve storage.For common syllable, only the syllable representative of this common syllable and the prosodic features difference between this common syllable and the representative of this syllable are represented in storage, when synthetic, just can recover original common syllable, in less storage data volume effectively, guaranteed the natural pronunciation characteristic of synthetic speech.
Above step has tentatively been finished syllable grouping in the pronunciation waveform library and syllable representative has been selected, and has reached the purpose that reduces voice unit basically.
In step 1.8 (comprising step 1.3-1.7), iterative modifications is carried out in the division of selected syllable representative and son group in conjunction with Auditory estimating.For the situation that bad syllable representative may occur, Auditory estimating provides an effective indemnifying measure.Wherein in step 1.6, Auditory estimating is carried out in selected syllable representative, listen to the waveform of grouping syllable representative and its tone pattern is tested.If assessment result is dissatisfied, then get back to phonetics grouping step 1.3, make amendment for ropy grouping, repartition son group or the representative of gravity treatment group syllable.
If assessment result is satisfied, then export a plurality of syllables second son group that obtains.This output result comprise the representative of corresponding syllable and should the child group in the rhythm vector difference of each common syllable between representing with this syllable.
Hybrid waveform compression of the present invention is to carry out according to the waveform characteristic of syllable pronunciation.In Chinese, the pronunciation waveform of a syllable generally comprises two parts, voiceless sound part (Unvoicedportion) and voiced sound part (Voiced portion).The voiceless sound part generally is positioned at the front portion of pronunciation waveform, and the voiced sound part generally is positioned at the rear portion of pronunciation waveform, and these two parts are positioned at the diverse location of syllable significantly, thereby can handle respectively them.Have only the syllable of voiced sound part for those, can directly carry out Parametric Analysis and handle it.
In addition, the voiceless sound partial amt is limited, and generally shares (having) by different syllables.Different with the voiced sound part, the waveform character of voiceless sound part is similar to noise signal, and the amplitude of voiceless sound portion waveshape signal is much smaller than the amplitude of voiced sound portion waveshape signal.After the Parametric Analysis device compressed this signal that is similar to noise, the data that can't guarantee to utilize this compression were with it is synthetic again and distortion is very little.That is to say, utilize the Parametric Analysis device that above-mentioned voiceless sound is partly compressed, can't guarantee that synthetic again voice have natural pronunciation quality (true man's voice quality).Therefore, in order to ensure the synthetic speech that generates natural pronunciation, the present invention partly utilizes the Parametric Analysis device to compress the back with the parameter mode storage voiced sound, and the voiceless sound part is stored in the waveform mode.
In addition, a favourable aspect is the length that voiceless sound length partly is generally less than the voiced sound part.So, the voiceless sound part is stored in the waveform mode, rather than it is carried out storing its parameter after the Parametric Analysis, do not increase considerably memory data output, but the synthetic speech quality that obtains is better.With less memory capacity is cost, has obtained effect preferably.And the difference between the prosodic features vector between the syllable is typically implemented in its voiced sound part, generally is used for the voiced sound part so the rhythm is revised.This factor need also allow to store in a different manner voiced sound part and voiceless sound part.
Below in conjunction with Fig. 2, describe the compression method of mixing syllable of the present invention (waveform) in detail.Fig. 2 is the process flow diagram according to hybrid waveform compression of the present invention.
In step 2.1, the waveform of each syllable being represented according to the waveform characteristic that is syllable is divided into two parts: voiced sound part (Voiced portion, W v) and voiceless sound part (Unvoiced portion, W u), this step is called syllable and splits.The syllable that has may not have the voiceless sound part, then directly handles its voiced sound part.Wherein the voiceless sound part is generally in the initial portion of opening of syllable, and signal amplitude is less, with respect to the voiced sound part, more similarly is noise signal; And the voiced sound part generally is positioned at the rear portion of syllable, and signal amplitude is stronger with respect to the voiceless sound part.Directly store its waveform for the voiceless sound part, compress by following steps for the voiced sound part.
In step 2.6 (comprising step 2.2-2.5), utilize the Parametric Analysis device that the waveform of above-mentioned voiced sound part is carried out Parametric Analysis, be designed for the code book of Parametric Analysis simultaneously.This code book depends on the people and the sound bank of recording.In conjunction with Auditory estimating, code book is carried out iterative modifications then, this step is called the code book design.
In step 2.2, utilize the Parametric Analysis device that described voiced sound is partly analyzed, obtain described voiced sound part (W v) parameter and code book thereof, these parameters are stored in the sound bank, this step is called voiced sound portion waveshape compression.
In step 2.3, according to voiced sound parameter and the code book partly that step 2.2 obtains, synthetic again voiced sound part.The voiced sound part can be used and come again synthetic with described Parametric Analysis device relevant parameters compositor.Voiceless sound part and voiced sound partly are stitched together, just can obtain a complete syllable waveform.
In step 2.4, with the parameter and the corresponding voiceless sound part combination of described voiced sound part, synthesize the waveform of a complete syllable, and it is carried out Auditory estimating.
If assessment result is dissatisfied, then carry out step 2.5, revise the code book of described voiced sound part, carry out step 2.2 again then, the Parametric Analysis device utilizes amended code book again voiced sound partly to be carried out Parametric Analysis.If assessment result is satisfied, the parameter of the resulting described voiced sound part of output parameter fractional analysis then can obtain mixing the syllable parameter of compression.
Utilize above-mentioned method and apparatus, spliced Chinese text to the speech syllable storehouse of speech conversion system may be compressed to several megabyte.Use the synthetic syllable that mixes syllable waveform compression scheme, its quality is better than the quality of utilizing the synthetic syllable that direct syllable compression scheme obtains in the prior art, has the feature of natural pronunciation.
It will be appreciated by those skilled in the art that hybrid waveform compression method of the present invention both can be used in combination with above-mentioned syllable group technology, compress the waveform of syllable representative in each syllable group; Also can use separately, come the syllable waveform is compressed according to real needs.
Briefly introduce device below in conjunction with Fig. 3 and Fig. 4 according to compressing voice library of the present invention.According to the device that is used for compressing voice library of the present invention as shown in Figure 3.This device is a kind of device that compresses the sound bank of Chinese text to the speech conversion system, comprise the device (not shown) that is used for collecting a plurality of syllables from a pronunciation storehouse, the device of described compressing voice library also comprises: syllable apparatus for grouping 31, be used for described syllable is divided into a plurality of syllable groups, wherein the syllable in each group has identical phonetic (Phonetic Spelling); Son group classification apparatus, comprise phonetics apparatus for grouping 33 and mix apparatus for grouping 34, be used for similarity according to the prosodic features of the whole syllables of described syllable group, with the syllabification in the described syllable group is a plurality of syllable groups, wherein the phonetics grouping is based on the similarity of this mediation phonetics/coarticulation environmental parameter of syllable, mixing acoustics/phonetics grouping is to be weighted by the prosodic features to syllable to carry out, weighting function is W={Wt, Wp, Wd, We, Wy}, wherein Wt is the weighting to this accent of syllable, Wp is the weighting to syllable fundamental tone profile, Wd is the weighting to the syllable duration, and We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter; Selecting arrangement 35 represented in syllable, is used for selecting syllable representative in each described son group, represents in this group other common syllable; The memory storage (not shown) is used to store described selected syllable representative, to form compressing voice library.
The device that is used for compressing voice library also comprises: prosodic features difference treating apparatus (not shown), be used in each son group, calculate the prosodic features difference between the representative of common syllable and syllable, and in described compressing voice library, store described prosodic features difference; Auditory estimating device 36 represented in syllable, is used for selected syllable representative carrying out Auditory estimating; Judgment means 39 represented in syllable, is used for judging according to assessment result, if assessment result is satisfied, then exports this syllable representative, otherwise, utilize modifier to operate; And modifier, be used for providing modification information to the selection of grouping and syllable representative.
Be used to compress device that syllable represents waveform as shown in Figure 4 according to of the present invention, this device comprises: pronunciation classification apparatus 51 is used for described syllable representative is divided into a voiceless sound part and a voiced sound part; And waveform processing device 56, be used for described voiceless sound part is directly stored in the waveform mode, voiced sound is partly compressed the back storage with the Parametric Analysis device.
Wherein waveform processing device 56 comprises pronunciation synthesizer 53, is used for again synthetic described voiced sound part, and the syllable that obtains synthesizing after voiced sound part and voiceless sound partly spliced is represented; Synthetic syllable Auditory estimating device 54 is used for the voiced sound part and the voiceless sound of synthetic syllable representative are partly carried out Auditory estimating, to obtain the syllable representative of satisfied mixing compression; The code book modifier, the assessment result that is used for drawing at apparatus for evaluating 54 is to revise the code book that Parametric Analysis device 52 is used for the compress voiced portion waveshape for dissatisfied.
Protection domain of the present invention is illustrated in the appended claims.But every within aim of the present invention, conspicuous modification is also due within protection scope of the present invention.

Claims (14)

1. a method of compressing the sound bank of Chinese text to the speech conversion system comprises from a pronunciation storehouse and collects a plurality of syllables, it is characterized in that this method comprises:
Described syllable is divided into a plurality of syllable groups, and wherein the syllable in each group has identical phonetic;
According to the similarity of the prosodic features of whole syllables in the described syllable group, be a plurality of syllables groups with the syllabification in the described syllable group;
In each described son group, select syllable representative, represent in this child group other common syllable;
The described selected syllable representative of compression storage is to form compressing voice library;
In each son group, calculate the prosodic features difference between the representative of common syllable and syllable, and in described compressing voice library, store described prosodic features difference.
2. the method for compressing voice library as claimed in claim 1 is characterized in that described prosodic features comprises: this accent of syllable, fundamental tone profile, duration, rms amplitude and phonetics/do with the pronunciation environmental parameter.
3. the method for compressing voice library as claimed in claim 1, it is characterized in that the sub-partition method of described syllable comprises: phonetics grouping and mixing acoustics/phonetics grouping, wherein the phonetics grouping is based on the similarity of this mediation phonetics/coarticulation environmental parameter of syllable, mixing acoustics/phonetics grouping is to be weighted by the prosodic features to syllable to carry out, weighting function is W={Wt, Wp, Wd, We, Wy}, wherein Wt is the weighting to this accent of syllable, Wp is the weighting to syllable fundamental tone profile, Wd is the weighting to the syllable duration, and We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter.
4. the method for compressing voice library as claimed in claim 1 is characterized in that also comprising: utilize the Parametric Analysis device that described syllable representative is compressed.
5. the method for compressing voice library as claimed in claim 1, it is characterized in that described method also comprises: to described syllable representative carrying out Auditory estimating, if it is dissatisfied to described syllable representative, then represent and/or repartition syllable group, repeat above-mentioned steps and represent until obtaining satisfied syllable according to the similarity gravity treatment syllable of prosodic features.
6. the method for compressing voice library as claimed in claim 1 is characterized in that described method also comprises:
Described syllable representative is divided into a voiceless sound part and a voiced sound part;
Described voiceless sound part is directly stored in the waveform mode; And
Parametric Analysis device code book in conjunction with the syllable representative utilizes the Parametric Analysis device that described voiced sound is partly compressed.
7. the method for compressing voice library as claimed in claim 6, it is characterized in that described method also comprises: synthetic again described voiced sound part, and with voiced sound part and voiceless sound partly splice the back to the voiced sound of described syllable representative partly and voiceless sound partly carry out Auditory estimating.
8. the method for compressing voice library as claimed in claim 7 is characterized in that described method also comprises: if dissatisfied to synthetic syllable representative, then revise the described code book that is used for the voiced sound part, otherwise storage is through mixing the syllable representative of compression in sound bank.
9. a device that compresses the sound bank of Chinese text to the speech conversion system comprises the device that is used for collecting from a pronunciation storehouse a plurality of syllables, it is characterized in that the device of the sound bank of described compression Chinese text to the speech conversion system also comprises:
Apparatus for grouping is used for described syllable is divided into a plurality of syllable groups, and wherein the syllable in each group has identical phonetic;
Son group classification apparatus is used for the similarity according to the prosodic features of the whole syllables of described syllable group, is a plurality of syllables groups with the syllabification in the described syllable group;
Selecting arrangement represented in syllable, is used for selecting syllable representative in each described son group, represents in this grouping other common syllable;
Memory storage is used to compress the described selected syllable representative of storage, to form compressing voice library; And
Prosodic features difference treating apparatus is used for calculating the prosodic features difference between the representative of common syllable and syllable in each son group, and stores described prosodic features difference in described compressing voice library.
10. the device of compressing voice library as claimed in claim 9, it is characterized in that described device also comprises: the weighted calculation device is used for the prosodic features of syllable is weighted calculating, weighting function is W={Wt, Wp, Wd, We, Wy}, wherein Wt is the weighting to this accent of syllable, and Wp is the weighting to syllable fundamental tone profile, and Wd is the weighting to the syllable duration, We is the weighting to the syllable rms amplitude, and Wy is the weighting to syllable phonetics/coarticulation environmental parameter.
11. the device of compressing voice library as claimed in claim 9 is characterized in that described device also comprises: the Parametric Analysis device is used to compress described syllable representative.
12. the device of compressing voice library as claimed in claim 9 is characterized in that described device also comprises: the pronunciation classification apparatus is used for described syllable representative is divided into a voiceless sound part and a voiced sound part; And the waveform processing device, be used for described voiceless sound part is directly stored in the waveform mode, voiced sound is partly compressed the back storage with the Parametric Analysis device.
13. the device of compressing voice library as claimed in claim 12 is characterized in that described device also comprises: synthesizer, be used for again synthetic described voiced sound part, and the syllable that obtains synthesizing after voiced sound part and voiceless sound partly spliced is represented; And the Auditory estimating device, be used for the voiced sound part and the voiceless sound of synthetic syllable representative are partly carried out Auditory estimating, to obtain the syllable representative of satisfied mixing compression.
14. the device of compressing voice library as claimed in claim 12 is characterized in that also comprising: the parameter synthesizer is used to splice voiced sound part and the voiceless sound part that described syllable is represented.
CN 02127004 2002-07-25 2002-07-25 Method and apparatus for compressing voice library Expired - Lifetime CN1238805C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 02127004 CN1238805C (en) 2002-07-25 2002-07-25 Method and apparatus for compressing voice library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 02127004 CN1238805C (en) 2002-07-25 2002-07-25 Method and apparatus for compressing voice library

Publications (2)

Publication Number Publication Date
CN1471027A CN1471027A (en) 2004-01-28
CN1238805C true CN1238805C (en) 2006-01-25

Family

ID=34143446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 02127004 Expired - Lifetime CN1238805C (en) 2002-07-25 2002-07-25 Method and apparatus for compressing voice library

Country Status (1)

Country Link
CN (1) CN1238805C (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787072B (en) * 2004-12-07 2010-06-16 北京捷通华声语音技术有限公司 Method for synthesizing pronunciation based on rhythm model and parameter selecting voice
CN102929495A (en) * 2012-09-07 2013-02-13 深圳市朵唯志远科技有限公司 Method and device for implementing dynamic wallpaper, and mobile terminal
CN103700367B (en) * 2013-11-29 2016-08-31 科大讯飞股份有限公司 Realize the method and system that agglutinative language text prosodic phrase divides
CN104916281B (en) * 2015-06-12 2018-09-21 科大讯飞股份有限公司 Big language material sound library method of cutting out and system

Also Published As

Publication number Publication date
CN1471027A (en) 2004-01-28

Similar Documents

Publication Publication Date Title
US9761219B2 (en) System and method for distributed text-to-speech synthesis and intelligibility
CN1169115C (en) Prosodic databases holding fundamental frequency templates for use in speech synthesis
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
CN100347741C (en) Mobile speech synthesis method
Bulyko et al. Joint prosody prediction and unit selection for concatenative speech synthesis
CN1167307A (en) Audio-frequency unit selecting method and system for phoneme synthesis
CN1356687A (en) Speech synthesis device and method
MXPA06003431A (en) Method for synthesizing speech.
CN1645478A (en) Segmental tonal modeling for tonal languages
CN1259631C (en) Chinese test to voice joint synthesis system and method using rhythm control
CN1819017A (en) Method for extracting feature vectors for speech recognition
CN1924994B (en) Embedded language synthetic method and system
CN101901598A (en) Humming synthesis method and system
CN106295717A (en) A kind of western musical instrument sorting technique based on rarefaction representation and machine learning
CN1212601C (en) Imbedded voice synthesis method and system
US20070073542A1 (en) Method and system for configurable allocation of sound segments for use in concatenative text-to-speech voice synthesis
AU2015411306A1 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN1238805C (en) Method and apparatus for compressing voice library
CN1811912A (en) Minor sound base phonetic synthesis method
CN1956057A (en) Voice time premeauring device and method based on decision tree
CN102063897B (en) Sound library compression for embedded type voice synthesis system and use method thereof
CN1032391C (en) Chinese character-phonetics transfer method and system edited based on waveform
CN1534595A (en) Speech sound change over synthesis device and its method
CN100337104C (en) Voice operation device, method and recording medium for recording voice operation program
CN114944146A (en) Voice synthesis method and device

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NUANCE COMMUNICATIONS INC

Free format text: FORMER OWNER: MOTOROLA INC.

Effective date: 20100916

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: ILLINOIS, USA TO: MASSACHUSETTS, USA

TR01 Transfer of patent right

Effective date of registration: 20100916

Address after: Massachusetts, USA

Patentee after: Nuance Communications, Inc.

Address before: Illinois, USA

Patentee before: Motorola, Inc.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200925

Address after: Massachusetts, USA

Patentee after: Serenes operations

Address before: Massachusetts, USA

Patentee before: Nuance Communications, Inc.

CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20060125