CN102314873A

CN102314873A - Coding and synthesizing system for voice elements

Info

Publication number: CN102314873A
Application number: CN2010102151351A
Authority: CN
Inventors: 孟智平
Original assignee: SHANGHAI SHIJIA INFORMATION TECHNOLOGY CO LTD
Current assignee: SHANGHAI SHIJIA INFORMATION TECHNOLOGY CO LTD
Priority date: 2010-06-30
Filing date: 2010-06-30
Publication date: 2012-01-11

Abstract

The invention discloses a coding and synthesizing system for voice elements, which can be used for low-bandwidth high-tone-quality voice transmission. On the basis of the digital voice transmission, built voice elements are used as coding objects, a voice element model base is built, and the voice elements are expressed by a unified method and codes, so the voice is further synthesized. In the method, through the analysis on the daily voice, a clustering algorithm is adopted for building the voice element model base; and then, an automatic voice element splitting algorithm is utilized for carrying out the automatic voice element splitting on obtained continuous voice streams, the Mel frequency ceptral coefficient (MFCC) characteristics of the voice elements are extracted, codes corresponding to the voice elements are obtained through the matched identification with the voice elements in the voice element model base, and the codes are used for replacing the voice elements for coding. In the voice synthesizing process, the voice elements corresponding to the codes are taken out from the voice element model base through the codes, processing such as interpolating fitting and the like is carried out on the spectral enveloping of the voice elements through the mathematical translation, and the smoothly transited voice is formed.

Description

A kind of coding of speech primitive and synthesis system

Technical field

The present invention relates to fields such as voice coding, voice transfer, voice call, relate in particular to a kind of coding and synthesis system of speech primitive.

Background technology

Along with the development of modern network technology, through the applied more and more of the Internet voice signal, especially popularizing rapidly of online chatting instrument made the networking telephone become a kind of tool of communications of liking.G.711 the present most networking telephone all adopts, G.723, G.726, G.729 waits general coding techniques, the voice during network transmits adopt more ratio of compression higher in, the low rate voice coding.Though the voice compression coding of low rate has brought convenience for the transmission of channel, has also saved storage space, because most of voice coding all is a lossy compression method, voice quality will certainly incur loss.These technological common ground all are to utilize the priori of people's ear perception that voice are carried out lossy compression method.The patent No. 00126112.6 discloses a kind of employing single frames, has become frame length, the adaptive low speed voice compression coding of frame Nepit method, the ability of encoding compression is further improved, and then improved data transmission efficiency.These coded systems all are to the human auditory system characteristics, and the patient lossy compression method scheme of designer's ear reaches the purpose that reduces code rate.In fact, if just encode to people's voice, do not relate to other problemses such as music, compressibility can also further be improved.

Phonetics research shows; Phoneme is the phonetic unit from the minimum of tonequality angular divisions; See that from pronunciation character the voice that people send all are to be made up of different phonemes, the combination of a phoneme or a plurality of phonemes; Having formed different syllables, promptly is a syllable like the pronunciation of each Chinese character.Find that through statistical study the phoneme number of people's pronunciation is limited in fact, and some phonemes are arranged is to be formed by some other phonotactics, hence one can see that, and each language just can count the basic phoneme that constitutes this language pronouncing characteristic.Announced the result recently in 2005 according to International Phonetic Symbols association and organization, in the known in the world pronunciation, lung's air-flow sound has 59, and non-lung air-flow sound has 14,12 of other consonants, and 28 of single vowels, other pronunciation is nothing more than the combination of these sounds.

When network voice transmission or voice call communication; What listener was concerned about usually only is the square voice messaging that sends of speaking; If the voice messaging that the content of transmission or communication has only the people to speak; Do not have other sound or filter other sound, then voice transfer further compression on existing method basis.

No matter in addition, find through waveform and spectrum envelope analysis to continuous speech stream, be in the same waveform that voice flow generated of one-time continuous; Still in the different wave that different phonetic stream is generated, a lot of waveforms are identical or closely similar, if before coding, can handle these waveforms; Waveform segment to having common trait is analyzed; Set up the waveform model bank, for different waveforms is given numbering, just can improve existing is the coded system that unit samples with the frame; But only the corresponding numbering of waveform is encoded, thereby greatly improve the efficient of coding.

The present invention is the coding unit with the speech primitive, has designed a kind of more excellent voice coding scheme.This scheme is according to the continuous speech flow data that obtains; Extract the relevant voice primitive, make up the speech primitive model bank, through the continuous speech stream that obtains is carried out cutting; The speech primitive of cutting and the speech primitive in the model bank are mated, obtain the speech primitive numbering of current speech.So the voice signal that the cepstrum signal that originally needed the spectrum signal or tens of dimensions up to a hundred to tie up is described only just can be described with an integer numbering now.In decoding, according to this integer, from the storehouse, obtain real spectrum signal reconstructed speech, thereby improve the compressibility of voice greatly.

Summary of the invention

For the voice flow data are carried out compressed encoding, speech data is effectively transmitted under low bandwidth or the relatively poor situation of network performance, the present invention at first discloses a kind of method that generates the speech primitive model bank, may further comprise the steps:

Obtaining a large amount of voice flow sample datas, and these voice flow data are carried out cutting, is the corpus that unit was constituted to obtain by different phonemes or different wave, and the elementary cell that constitutes corpus is called speech primitive;

Extract the characteristic of speech primitive, the constitutive characteristic vector;

Speech primitive proper vector sample is carried out fuzzy clustering, all data samples are divided into the N class, obtain corresponding cluster centre and membership function;

Analyze the characteristic of various types of voice primitive, and then confirm to plan to build the required minimum speech primitive of speech primitive model bank;

Characteristics of speech sounds to the various types of voice primitive carries out analyzing and processing, obtaining the spectrum envelope characteristic of each type speech primitive, and it is stored in the speech primitive model bank, finally constitutes the speech primitive model bank;

Said the voice flow data being carried out cutting, is to be unit with phoneme or frame, and continuous speech stream is carried out cutting;

Said is that unit carries out cutting and is meant and adopts phoneme automatic segmentation algorithm with the phoneme, and continuous voice flow automatically is cut into by the different set of phonemes that phoneme constituted;

Said is that unit carries out cutting and is meant that with frame sometime be unit with the frame, continuous voice flow is cut into the speech waveform set that is made up of different wave;

Said speech primitive model bank is meant phoneme sample storehouse or the minimum speech waveform sample storehouse that constitutes the required minimum of intelligible voice flow;

Said phoneme automatic segmentation algorithm may further comprise the steps:

It is the syllable sequence of unit that the continuous speech stream automatic segmentation that obtains is become with the syllable;

Each syllable is further analyzed the formation of its phoneme;

Constituting if this syllable is single phoneme, is corresponding phoneme with this syllable splitting then;

Constitute if this syllable is a plurality of phonemes, then, finally be cut into the single phoneme of several separate the further careful cutting of this syllable;

Adopt any in the existing fundamental frequency extraction algorithms such as AMDF, AC, CC, SHS, extract each phoneme fundamental frequency F0;

(Mel-Frequency Cepstrum Coefficients MFCC) as the phonic signal character parameter, extracts the spectrum envelope of each phoneme to adopt the Mel frequency cepstral coefficient;

Adopt hidden Markov model that phoneme characteristic parameter sample set is trained, discerned, finally confirm the correlation parameter in the model, the hidden Markov model after the training test just can be used for the phoneme that is comprised in the continuous speech stream is carried out automatic segmentation.

The method that said cutting voice flow obtains different wave also comprises:

With identical time frame is cut-off, and the waveform that continuous speech is flowed carries out cutting, obtains different voice waveform sets under the equal time frame condition;

Or be cut-off with the different time frame, the waveform of continuous speech stream is carried out cutting, obtain the different phonetic waveform sets under the different time frame condition;

Adopt any in the existing fundamental frequency extraction algorithms such as AMDF, AC, CC, SHS, extract the speech pitch F0 of each section waveform after the cutting;

(Mel-Frequency Cepstrum Coefficients MFCC) as the phonic signal character parameter, extracts the spectrum envelope of every section waveform to adopt the Mel frequency cepstral coefficient.

The process of said generation speech primitive model bank is further comprising the steps of:

Adopt the method for fuzzy clustering that set of phonemes or waveform sets are carried out cluster analysis, phoneme or waveform are divided into the N class;

Phonetic feature to each type phoneme or waveform is analyzed; Respective combination with cluster centre point or other points is an object; Substitute such phone set or waveform collection; Promptly concentrate and extract a phoneme or a waveform, finally extract N phoneme or N waveform to represent such from same type phoneme or waveform;

Confirm N the phoneme of taking-up or the fundamental frequency F0 and the spectrum envelope of N waveform;

Give its corresponding numbers with an above-mentioned N phoneme or N waveform according to certain principle, to be numbered order the relevant information of N phoneme or N waveform is stored, to constitute the speech primitive model bank.

The invention also discloses a kind of voice coding method, may further comprise the steps based on the speech primitive model bank:

Continuous voice flow is carried out automatic segmentation, obtain speech primitive and fundamental frequency F0 thereof, and extract the spectrum envelope of speech primitive; Said speech primitive is meant the speech waveform of phoneme or equal time frame or the speech waveform of different time frame;

Speech primitive that extracts and the speech primitive in the speech primitive model bank are mated,, then return these voice and be based on pairing speech primitive numbering in the speech primitive model bank if mate successfully;

The speech primitive numbering of returning, fundamental frequency F0 and other relevant informations of speech primitive are encoded according to certain form;

Adopt compression algorithm that coded data is further compressed, to divide into groups or Circuit-switched form arrives the destination through IP network or telephone communication system with this compress speech data packet transmission;

Said speech primitive coupling may further comprise the steps:

Gather continuous voice stream information;

Continuous speech stream to obtaining is analyzed, and adopts speech primitive automatic segmentation algorithm to be divided into the speech primitive sequence, i.e. aligned phoneme sequence or wave sequence;

The speech primitive that to cut apart is directly or through after conversion or carrying out operation such as Error processing, carries out pattern match with speech primitive in the speech primitive model bank;

If mate successfully then return the pairing numbering of speech primitive and other relevant informations;

If coupling gets nowhere then adopts corresponding fault-tolerance processing method;

Said speech primitive conversion is meant that the mode of handling through curve fitting, noise error carries out analyzing and processing to the abnormal case of speech primitive, so that the speech primitive in itself and the speech primitive model bank matees;

The curve fitting of said speech primitive is meant through least square method or B batten or cubic spline interpolation, and the incomplete speech primitive squiggle of information is carried out match, to restore the script waveform of this speech primitive;

Said speech primitive Error processing is meant through adopting voice enhancement algorithm, and speech primitive is handled, and to eliminate noise, to strengthen speech intelligibility, improves the voice naturalness;

Said fault-tolerance processing method is meant through tolerant fail algorithm, handles mating unsuccessful speech primitive, makes speech have stronger robustness and robustness.

Said cataloged procedure may further comprise the steps:

Obtain fundamental frequency F0 and other relevant informations of speech primitive numbering, speech primitive;

Fundamental frequency F0 and other relevant informations to speech primitive numbering, speech primitive are analyzed, to confirm suitable coding method;

Adopt one of coding methods such as LZW, Huffman (Huffman), Manchester, unipolar code that above-mentioned information is encoded;

Character string behind the coding is called the speech primitive coded strings.

Said to coded data further compression may further comprise the steps:

Receive the speech primitive coded strings;

Adopt the compression analytical algorithm that the speech primitive coded strings is analyzed,, then adopt compression algorithm that it is compressed, then to the transmission of packing of the speech primitive packet after the compression if this speech primitive coded strings has the space of further compression;

If this speech primitive coded strings does not have compressible space, then do not compress, directly to the transmission of packing of the speech primitive packet after the compression;

Said packing transmission is meant the related protocol that adopts in IP network agreement or the circuit switching, and compressed data packet is transmitted through IP network or telephone system with grouping or Circuit-switched form, delivers to the destination.

The present invention also provides a kind of tone decoding method based on the speech primitive model bank, may further comprise the steps:

The take over party receives the speech primitive compressed data packets;

According to this packet being carried out decompression with the corresponding decompression algorithm of compression algorithm;

From the packet that decompresses, obtain the speech primitive coded strings;

According to the speech primitive encryption algorithm, the speech primitive coded strings is carried out reverse decode operation, to obtain raw tone primitive serial data;

From the speech primitive serial data, obtain speech primitive numbering, speech primitive fundamental frequency F0 and other relevant informations;

According to the speech primitive numbering, search the speech primitive model bank, take out the phonetic feature of the corresponding speech primitive of this numbering, and carry out phonetic synthesis based on this;

Through phoneme synthesizing method, the speech primitive that sends is reduced to intelligible, voice messaging clearly;

Said phoneme synthesizing method is further comprising the steps of:

The speech primitive numbering that analysis receives if this numerical value is normal, then according to this numerical value voice inquirement basic-element model storehouse, otherwise is carried out fault-tolerant processing or is ignored this speech primitive;

Be numbered search condition with speech primitive, from the speech primitive model bank, take out and to number pairing speech primitive, i.e. phoneme or waveform;

Fundamental frequency F0 and other relevant informations according to the phonetic feature of the speech primitive that takes out, this speech primitive of receiving are synthesized voice.

The present invention also provides a kind of voice coding and synthetic method based on speech primitive, may further comprise the steps:

Obtain a large amount of voice flow sample datas,, constitute the speech primitive model bank through its sample data is handled;

Continuous speech stream to getting access to carries out cutting; Obtain speech primitive and fundamental frequency F0 thereof; Then the speech primitive in this speech primitive and the speech primitive model bank is mated, obtain corresponding speech primitive numbering, adopt coding method speech primitive numbering speech primitive fundamental frequency F0 to be encoded with other phonetic feature satellite informations according to certain form; Packet behind the coding is further compressed, this compress speech data packet transmission is arrived the destination through IP network or telephone network;

After the take over party receives the compress speech packet; Adopt corresponding decompression algorithm decompressed data bag; Search the speech primitive model bank according to the speech primitive numbering, take out the pairing phonetic feature of this speech primitive, and be reduced to voice with other satellite informations according to fundamental frequency F0.

The invention also discloses a kind of voice coding and synthesis system, comprise with lower module: pre-processing module, voice coding module and tone decoding module based on speech primitive;

Said pre-processing module is responsible for collection analysis continuous speech stream; Voice flow is carried out cutting; It is cut into the speech primitive sequence; And a large amount of speech primitives is carried out cluster analysis through clustering algorithm, make up the speech primitive model bank on this basis, for voice coding module and tone decoding module invokes;

Said voice coding module is responsible for the voice flow that receives is carried out cutting to obtain speech primitive and fundamental frequency F0 thereof; From the speech primitive model bank, obtain the pairing numbering of this speech primitive according to the speech primitive matching algorithm; Then speech primitive numbering, fundamental frequency F0 and other satellite informations are encoded according to the corresponding encoded algorithm; And adopt compression algorithm to its further compression, then its packing is sent;

Said tone decoding module is responsible for receiving the VoP that the voice coding module sends; It is decompressed; Obtain the speech primitive numbering, be numbered search condition with this, voice inquirement basic-element model storehouse; Extract the corresponding speech primitive information of this numbering, finally through phonetic synthesis algorithm reduction voice.

Said voice coding and synthesis system based on speech primitive comprise voice transmitting terminal and voice receiving end;

Said voice transmitting terminal comprises speech primitive model bank, voice coding module;

Said voice receiving end comprises speech primitive model bank, tone decoding module.

Through method provided by the invention, when carrying out voice transfer, only need numbering, fundamental frequency signal and the phoneme tone coding of speech primitive in the transferring voice basic-element model storehouse to get final product.That is to say; If adopt 256 clusters human voice are described; And fundamental frequency signal adopts a byte to write down, and every frame voice signal (normally 25 milliseconds voice adopt the 16K16BitsPCM form to need 800 bytes) only needs 2 bytes to represent.

After VoP is transferred to the destination, by the tone decoding module speech data of receiving is decoded, and accomplish phonetic synthesis work by phoneme synthesizing method.

The phonetic synthesis process is from the speech primitive model bank, to obtain the speech manual envelope characteristic according to the speech primitive numbering.Because the template matches assorting process possibly produce mistake; Need carry out smoothly the characteristic of taking out; If distance is excessive between the adjacent template, people's ear will be heard irritating noise, therefore; The process of mapping from the template sequence number to characteristic, be not only the template average is taken out so simple.In the ATL, also need preserve the first order difference and the second order difference information of each characteristic, in decoding, utilize least square method to solve the matching error minimum, the first order difference error is also minimum, the dynamic spectrum envelope that second order difference is also minimum.

At last, generate driving source, use this signal of spectrum envelope filtering again, synthetic relevant voice with smooth spectrum envelope with fundamental frequency F0.

Beneficial effect of the present invention mainly comprises:

(1) with in the past be unit with the frame; Voice to each frame are sampled, Methods for Coding is compared, and the present invention is that unit encodes with the speech primitive, because the speech primitive number that each language constituted is limited; Therefore, be that unit encodes and reduced space encoder with the speech primitive;

(2) the present invention is through setting up the speech primitive model bank; When speech primitive is encoded,, replace the sampled point in the coding method in the past, promptly substitute a plurality of numerical value with a numerical value with the corresponding numbering numerical value of speech primitive model; Reduce the length of coded string, improved the efficient of coding;

(3) on the basis of encoding with speech primitive numbering numerical value, the present invention adopts corresponding compression algorithm that its compressibility is analyzed, thereby further compression so that under the situation relatively poor at network performance, that bandwidth is less, can be carried out reliably voice messaging

(4) the present invention a kind ofly is in limit voice coding, transmission and synthetic method under the ultimate limit state at network performance, can be used for some in particular cases to the demand of voice communication.

Description of drawings

Fig. 1 is the overall system frame diagram;

Fig. 2 extracts the MFCC characteristic pattern;

Fig. 3 is a phoneme cutting process flow diagram

Embodiment

Speech primitive among the present invention can be a phoneme, also can be the waveform that waits frame or become the frame intercepting, adopts the different voice primitive just can set up different voice basic-element model storehouse.In the specific implementation, can be the basis with a kind of model bank wherein, the voice to transmission on this carry out Code And Decode; Also can several kinds of model bank combinations be used, the voice of some complicacies are in particular cases encoded.

Of the present inventionly be contemplated that basically: gather a large amount of voice flow data samples; Continuous voice flow is carried out the automatic segmentation of speech primitive; Form the speech primitive collection; Extract the characteristic of speech primitive, and adopt the method for fuzzy clustering that the speech primitive collection is carried out cluster, thereby set up the speech primitive model bank; Speech primitive model bank to set up is the basis; When obtaining continuous speech stream; Then it is carried out the automatic segmentation of speech primitive, in the speech primitive model bank, search out then and the immediate model of current speech primitive, numbering and other relevant informations that adopts this model is transferred to the take over party after through voice coding; The take over party is numbered according to the speech primitive of receiving by the tone decoding processing module after receiving this VoP; Search the speech primitive model bank, and based on context revaluation goes out speech envelope, in conjunction with the fundamental frequency synthetic speech.

Fig. 1 is an overall system frame diagram of the present invention.

At first, adopt hidden Markov model (HMM) that continuous speech stream sample is carried out the automatic segmentation of speech primitive, constitute corpus at 101 places;

At 102 places, the method through Fig. 2 Mel frequency cepstral coefficient (Mel-Frequency CepstrumCoefficients) extracts the MFCC characteristic from each speech primitive;

MFCC is defined as the real cepstrum of voice signal through resulting windowing short signal after the FFT.Be that with the difference of real cepstrum it has used the non-linear frequency scale, be close with auditory system with the people.

After through the MFCC algorithm characteristic of speech primitive being extracted, each speech primitive just can be expressed as corresponding eigenvector, and corpus just converts corresponding speech primitive eigenvector storehouse into.

At 103 places, through the method for Fig. 3 fuzzy clustering, according to the MFCC characteristic of speech primitive; Speech primitive collection to constituting carries out cluster, according to the characteristic of employed language, it is gathered the class for N; And then construct the model bank that includes N class speech primitive, concrete cluster process is:

The speech primitive feature set of preparation for acquiring at first, X={x _i, i=1,2 ..., n} is the sample set that n speech primitive sample formed, c is predetermined classification number, m _j, j=1,2 ... c is the center of each cluster, μ _j(x _i) be the membership function of i sample for the j class.With the cluster loss function of membership function definition formula (1) as follows.

J = Σ_{j = 1}^{c} Σ_{i = 1}^{n} {[μ_{j} (x_{i})]}^{b} {| | x_{i} - m_{j} | |}^{2} - - - (1)

Wherein, b＞1 is the fuzzy index that can control cluster result.

Under the different membership grade define method, minimize the loss function of formula (1), and to require a sample be 1 for the degree of membership sum of each type cluster, that is:

Σ_{j = 1}^{c} μ_{j} (x_{i}) = 1, i = 1,2, . . ., n - - - (2)

Under conditional (2), ask the minimal value of formula (1), make J m _jAnd μ _j(x _i) partial derivative be 0, can get necessary condition:

m_{j} = \frac{Σ_{i = 1}^{n} {[μ_{j} (x_{i})]}^{b} x_{i}}{Σ_{i = 1}^{n} {[μ_{j} (x_{i})]}^{b}}, j = 1,2, . . ., c - - - (3)

μ_{j} (x_{i}) = \frac{{(1 / {| | x_{i} - m_{j} | |}^{2})}^{\frac{1}{b - 1}}}{Σ_{k = 1}^{c} {(1 / {| | x_{i} - m_{k} | |}^{2})}^{\frac{1}{b - 1}}} - - - (4)

Find the solution formula (3) and formula (4) with alternative manner; When algorithm convergence; The cluster centre that has just obtained all kinds of phonemes and each sample be for all kinds of degree of membership values, thereby accomplished the division of fuzzy clustering, and each type speech primitive is further handled; Extract the speech primitive that to represent such, thereby make up the speech primitive model bank.

After the speech primitive model bank is set up, just can the continuous speech stream that obtain be analyzed based on this speech primitive model bank.At 104 places, to obtain voice flow carry out the automatic segmentation of speech primitive, and to adopt the Mel frequency cepstral coefficient be the phonic signal character parameter, extracts the characteristic of speech primitive:

c_{n} = Σ_{m = 0}^{M - 1} S_{2} [m] * \cos (\frac{2 πmn}{2 M}), n = 0,1, . . ., N - 1 - - - (5)

m = \frac{1000 \cdot \ln (1 + \frac{f}{700})}{\ln (1 + \frac{1000}{700})} \approx 1127 \ln (1 + \frac{f}{700}) - - - (6)

At 105 places, judge the best model that current MFCC characteristic is corresponding through following formula:

P (M_{i} | X) = \frac{P (X | M_{i}) P (M_{i})}{\underset{j}{Σ} P (X | M_{j}) P (M_{j})} - - - (7)

P (X | M_{i}) = \frac{1}{\sqrt{2 π} | Σ |} \exp {- \frac{1}{2} {(X - μ)}^{T} Σ^{- 1} (X - μ)} - - - (8)

Final acquisition best model sequence number is n=argmax _i{ P (M _i| X) }

At 106 places,, encode according to certain form with phoneme model corresponding sequence number n, fundamental frequency and other relevant informations;

At 107 places, according to the coded message that 106 places send, adopt compression algorithm, and its packing is transmitted according to procotol to its further compression;

At 108 places, according to best model sequence number n, take out average, first order difference, the second order difference of corresponding model, the knowledge of associating front N frame adopts least square method, is principle with the sum of the deviations minimum, obtains best spectrum envelope characteristic.

At 109 places, according to fundamental frequency F0, generate the uniform excitation source signal of frequency spectrum, this signal is carried out filtering, make that its spectrum envelope is the envelope that 104 places extract, then these voice are for recovering the result of coming out.

Be example with the phoneme below, further set forth automatic segmentation process, cluster, established model storehouse and the coding and decoding process of phoneme:

After obtaining continuous speech stream; Just can analyze continuous speech stream; As shown in Figure 2; Being that unit carries out cutting to continuous voice flow with the syllable earlier, promptly is a syllable like each word in the Chinese speech pronunciation, and the pronunciation that this cutting process is actually each word in the continuous speech stream cuts out;

After being syncopated as syllable, again each syllable is analyzed,, then deposited this phoneme in corpus if this syllable is to be made up of single phoneme;

If this syllable is not to be made up of single phoneme, then to its further cutting, it is cut into by a plurality of single phonemes constitutes, and deposit these phonemes in corpus;

" based on the automatic segmentation of phoneme in the continuous flow of the mandarin of HMM " with reference to Zheng Hong; If regard the speech data that occurs in the continuous speech stream as a stochastic process; Then voice sequence can be regarded a random series as, and then sets up Markov chain and hidden Markov model (HMM);

For the HMM model distributes integrating instrument, and with the integrating instrument zero clearing;

Acquisition contains the corpus of a large amount of phonemes, voice sequence sample corresponding descriptor number corresponding HMM is coupled together form a combination HMM then;

The forward direction of calculation combination HMM and backward probability;

Use the forward direction of calculating gained and the state occupation probability that backward probability calculates each time frame, upgrade corresponding integrating instrument;

Data in all speech data samples are carried out said process, accomplish the training to speech samples;

Use the new estimated parameter of the value calculating HMM of integrating instrument;

State θ with each HMM _iThe copy transfer of each token that has is to all adjacent state θ _j, and increase the logarithm probability log{a that this token copies _Ij}+log{b _j(O _i);

All tokens that each succeeding state inspection front state transfer is come, the token of reservation maximum probability, remaining abandons;

Through behind the said process, just can discern cutting automatically to continuous voice flow, obtain continuous aligned phoneme sequence.

After accomplishing the automatic segmentation of above-mentioned phoneme, just can carry out fuzzy clustering, can set the cluster number of fuzzy clustering according to the constitutive characteristic of different language phoneme to phone set; Voice like Chinese can and constitute by 29 basic phonemes; Specifically referring to " the basic phonemic analysis in the mandarin pronunciation identification " of Huang Zhongwei etc., therefore, in the present embodiment when phoneme is carried out cluster; The number of cluster is made as 30; Fuzzy index b is made as 2, accomplish cluster after, with the class heart of each type characteristic phoneme as such:

m_{j} = \frac{Σ_{i = 1}^{n} {[μ_{j} (x_{i})]}^{b} x_{i}}{Σ_{i = 1}^{n} {[μ_{j} (x_{i})]}^{b}}, j = 1,2, . . ., c

Therefore, just can generate one by 30 speech primitive model banies that phoneme constituted, the structure of this speech primitive model bank is following:

The speech primitive numbering

Speech primitive

The speech primitive fundamental frequency

The speech primitive waveform

Adopt the Mel frequency cepstral coefficient, extract the spectrum envelope characteristic of each phoneme in the continuous speech stream of receiving, and the waveform of the speech primitive in itself and the speech primitive model bank is mated, thereby obtain the numbering of current phoneme.

With the phoneme numbering that obtains continuously, the fundamental frequency of phoneme, encode, and can pass through compression algorithm, further compress like the LZW data compression algorithm, the packet after will compressing then is transferred to the destination through network or telephone communication network.

After receiving end receives that packet decompresses; Take out the phoneme numbered sequence in the packet, and according to best model numbering n, average, first order difference, the second order difference of taking out corresponding model; The knowledge of associating front N frame; Adopting least square method, is principle with the sum of the deviations minimum, obtains best spectrum envelope characteristic.

At last,, generate the uniform excitation source signal of frequency spectrum, this signal is carried out filtering, make that its spectrum envelope is the envelope that 104 places extract, from voice restoration according to fundamental frequency F0.

More than disclosedly be merely a specific embodiment of the present invention, still, the present invention is not limited thereto, any variation that designs according to this patent summary of the invention institute's describing method all should fall into protection scope of the present invention.

Claims

1. the coding of a speech primitive and synthesis system is characterized in that comprising with lower module: pre-processing module, voice coding module and tone decoding module;

2. according to claim 1 based on the voice coding and the synthesis system of speech primitive, it is characterized in that: comprise voice transmitting terminal and voice receiving end;