CN101167125A

CN101167125A - Method and apparatus for phase matching frames in vocoders

Info

Publication number: CN101167125A
Application number: CNA2006800144603A
Authority: CN
Inventors: 罗希特·卡普尔; 塞拉芬·迪亚兹·斯平多拉
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2005-03-11
Filing date: 2006-03-13
Publication date: 2008-04-23
Anticipated expiration: 2026-03-13
Also published as: CN101171626A; UA90506C2; CN101171626B; CN101167125B

Abstract

In one embodiment, the present invention comprises a vocoder having at least one input and at least one output, an encoder comprising a filter having at least one input operably connected to the input of the vocoder and at least one output, a decoder comprising a synthesizer having at least one input operably connected to the at least one output of the encoder, and at least one output operably connected to the at least one output of the vocoder, wherein the encoder comprises a memory and the encoder is adapted to execute instructions stored in the memory comprising classifying speech segments and encoding speech segments, and the decoder comprises a memory and the decoder is adapted to execute instructions stored in the memory comprising time-warping a residual speech signal to an expanded or compressed version of the residual speech signal.

Description

Be used for the frame in the vocoder is carried out the method and apparatus of phase matching

Advocate right of priority according to 35U.S.C § 119

The application's case advocate the title of filing an application on March 16th, 2005 for " Method and Apparatus forPhase Matching Frames in Vocoders " the 60/662nd, No. 736 U.S. Provisional Application cases and the title of filing an application on March 11st, 2005 are the 60/660th of " Time Warping Frames Inside the Vocoder by Modifying theResidual " the, the right of No. 824 U.S. Provisional Application cases, whole disclosure of described application case are regarded as the part of present application disclosure and are incorporated herein with application mode.

Technical field

By and large, the present invention relates to a kind of method of illusion of correcting in the voice decoder to be caused.In packet switching system, use de-jitter buffer to come storage frame also in regular turn it to be sent subsequently.But two continuous middle insertions of frame of sequence number of being everlasting during the method for described de-jitter buffer are wiped.In some cases, this can cause wiping and is inserted between two successive frames, and in some cases, can cause some frame to be skipped, thereby causes scrambler and demoder phase place asynchronous.As a result, may introduce illusion in the described decoder output signal.

Background technology

The present invention includes and a kind ofly prevent when being used for after one or more the wiping of decoding, frame being decoded or make through the minimized Apparatus and method for of the illusion of decoded speech.

Summary of the invention

In view of the above, by and large, disclosed feature relates to one or more improved system that are used to transmit voice, method and/or equipment.

In one embodiment, the present invention includes the method for illusion in a kind of minimizing voice, it comprises the step that makes the frame phase matching.

In another embodiment, the described step of frame phase matching that makes comprises that the speech samples quantity that changes described frame is to mate the phase place of described scrambler and demoder.

In another embodiment, the present invention includes following step:, then frame is carried out the time distortion to increase the quantity of described frame speech samples if the step of described phase matching has reduced the quantity of speech samples.

In another embodiment, use the code-excited linear prediction described voice of encoding, and distortion of described time step comprises: estimate pitch delay; Speech frame is divided into pitch period, and the border of wherein said pitch period is to use the pitch delay at difference place in the described speech frame to determine; And if expanded described speech residual signal, pitch period would then be added.

In another embodiment, use the prototype pitch period described voice of encoding, and the step of distortion of described time comprises: estimate at least one pitch period; Described at least one pitch period of interpolation; When the described remaining voice signal of expansion, add described at least one pitch period.

In another embodiment, the present invention includes: vocoder, it has at least one input and at least one output; Scrambler, it comprises wave filter, but described wave filter has at least one is connected to described vocoder input with mode of operation input and at least one output; And demoder, it comprises compositor, but but described compositor has the input of at least one at least one output that is connected to described scrambler with mode of operation and at least one is connected to the output of at least one output of described vocoder with mode of operation, wherein said demoder comprises storer, and described demoder is suitable for carrying out the instruction that is stored in the described storer, and described instruction comprises carries out phase matching and time distortion to speech frame.

According to hereinafter instructions, claims and graphic, other scope of application of the present invention will become obvious.Yet should understand, although described instructions and instantiation are to show preferred embodiment of the present invention, but only provide, because the person of ordinary skill in the field will be easy to draw various variation and the modifications that belong in purport of the present invention and the category to illustrate mode.

Description of drawings

According to the detailed description that hereinafter provides, enclose claims and accompanying drawing, can more intactly understand the present invention, graphic in:

Fig. 1 is the curve of three continuous voiced frames, the continuity of its shows signal;

Fig. 2 A is illustrated in it and wipes the frame that repeats afterwards;

Fig. 2 B graphic extension is owing to carrying out the phase place interruption that repetition causes to frame after wiping at it, and it is shown as a D;

Fig. 3 graphic extension combination ACB and FCB information are to form the frame through the CELP decoding;

Fig. 4 A illustrates the FCB pulse of inserting with correct phase;

Fig. 4 B illustrates owing to after wiping frame being carried out repetition and causes the FCB pulse of inserting with non-correct phase;

Fig. 4 C graphic extension makes FCB pulse displacement so that it inserts with correct phase place;

How Fig. 5 A graphic extension is carried out PPP to the signal of previous frame and is expanded to form the sample more than 160;

Fig. 5 B graphic extension causes the end phase place of present frame incorrect owing to frame is wiped free of;

Fig. 5 C illustrates wherein and produces the sample of lesser amt so that the embodiment that present frame finishes at phase place ph2=ph1 from present frame;

Fig. 6 graphic extension is twisted wiping with infilled frame 5 to frame 6;

Phase differential between Fig. 7 graphic extension frame 4 ends and frame 6 beginnings;

The wherein said demoder of Fig. 8 graphic extension is play an embodiment who wipes and prepare decoded frame 5 then after decoded frame 4;

The wherein said demoder of Fig. 9 graphic extension is play the embodiment that once wipes and prepare then decoded frame 6 after decoded frame 4;

The wherein said demoder of Figure 10 graphic extension two embodiment that wipe and prepare decoded frame 5 then that after decoded frame 4, decode;

The wherein said demoder of Figure 11 graphic extension two embodiment that wipe and prepare decoded frame 6 that after decoded frame 4, decode;

The wherein said demoder of Figure 12 graphic extension two embodiment that wipe and prepare decoded frame 7 that after decoded frame 4, decode;

Figure 13 graphic extension warped frame 7 wiping with infilled frame 6;

Two the wiping that Figure 14 graphic extension will be used for lost packets 5 and 6 converts single erase to;

Figure 15 is the calcspar of an embodiment of the employed linear predictive coding of the inventive method and equipment (LPC) vocoder;

Figure 16 A is the voice signal that comprises voiced speech;

Figure 16 B is the voice signal that comprises unvoiced speech;

Figure 16 C is the voice signal that comprises transient speech;

Figure 17 is the calcspar that the graphic extension coded residual is carried out LPC filtering to voice afterwards;

Figure 18 A is the curve of raw tone;

Figure 18 B is the curve of the remaining voice signal after the LPC filtering;

Figure 19 graphic extension uses the interpolation between previous and the current prototype pitch period to produce waveform;

Figure 20 A illustrates by interpolation and determines pitch delay;

Figure 20 B illustrates the identification pitch period;

Figure 21 A representative is the primary speech signal of pitch period form;

That Figure 21 B represents to use is overlapping-add the voice signal of expanding;

That Figure 21 C represents to use is overlapping-add the voice signal that compresses;

Figure 21 D represents how to use weighting to compress described residue signal;

Figure 21 E is illustrated in the voice signal that compresses under the situation of not using overlapping-interpolation;

Figure 21 F represents how to use weighting to expand described residue signal;

Figure 22 comprises two two equations that use in described interpolation-method of superposition; And

Figure 23 is the logical block diagram that is used for the device 213 of phase matching and is used for the device 214 of time distortion.

Embodiment

I joint: remove illusion

In this article, " illustrative " speech is used for meaning " separating as example, example or example ".Any embodiment that is illustrated as " illustrative " herein may not be considered as than other embodiment to good or favourable.

The inventive method and equipment use phase matching interruption in the correcting decoder signal when may signal phase at scrambler and demoder asynchronous.Described method and apparatus also uses to hide through the future frame of phase matching and wipes.The benefit of described method and apparatus is especially remarkable under two situations of wiping, and is well-known, two obvious degradations that can cause sound quality of wiping.

Owing to after wiping version at it frame being carried out the voice illusion that repetition causes

People wish to keep the phase continuity of signal from a voiced frame 20 to next voiced frame 20.For keeping the continuity of signal from a voiced frame 20 to another voiced frame, voice decoder 206 normally comes received frame in order.Fig. 1 just shows an example about this.

In packet switching system, voice decoder 206 uses de-jitter buffer 209 to come the stored sound frame and in order it is sent subsequently.If do not receive frame before the time at keyframe playback, then de-jitter buffer 209 can insert in the middle of the frame 20 of two continuous sequences numbering and wipe 240 and come the frame 20 of place of lost.Therefore, when expectation obtained frame 20 but do not receive, wiping 240 can be substituted by receiver 202.

About this example displayed map in 2A.In Fig. 2 A, the previous frame 20 that is sent to voice decoder 206 is No. 4 frames.Frame 5 is to send out the next frame of delivering to demoder 206, but it is not present in the de-jitter buffer 209.Therefore, this causes wiping 240 replacement frame 5 and is sent to demoder 206.Therefore, since there is not any frame 20 after the frame 4, so broadcast is to wipe 240.Afterwards, de-jitter buffer 209 receives frame No. 5, and it is sent to demoder 206 as next frame 20.

Yet the phase place of wiping place, 240 ends is different from the phase place at place, frame 4 ends usually.Therefore, with opposite after frame 4, wiping after 240, the decoding of No. 5 frame can cause phase place to be interrupted, and is shown as a D in Fig. 2 B.In fact, when demoder 206 (after frame 4) structure wiped 240, it made waveform expand 160 pulse code modulation (PCM) sample, supposes in this embodiment, and there are 160 PCM samples in every speech frame.Therefore, each speech frame 20 will make 160 PCM sample/pitch periods of described phase change, and its medium pitch is the base frequency of speaker's sound.Pitch period 100 can be from about 30 PCM sample movements of high-pitched tone woman voice to male sex's sound 120 PCM samples.In an example, if be phase place 1 with the phase mark at frame 4 ends places, and (it is little to suppose that it changes with pitch period 100; If pitch period is changing, then the pitch period in the equation 1 can be substituted by the average pitch cycle) be labeled as PP, phase place (is unit with the radian)-phase place 2 of then wiping place, 240 ends will equal:

Phase place 2=phase place 1 (is unit with the radian)+(160/PP) multiply by 2 π equations 1

Wherein speech frame has 160 PCM samples.If the 160th, the multiple of pitch period 100, the phase place-phase place 2 of then wiping place, 240 ends will equal phase place 1.

Yet, if 160 be not the multiple of PP, just phase place 2 is not equal to phase place 1.It is asynchronous to this means that scrambler 204 and demoder 206 may phase place occur.

The another way of describing this phase relation is the modular arithmetic that shows in the following equation by using, wherein " mod " expression " with ... be modulus ".Modular arithmetic is a kind of algorithmic approach of integer, and afterwards, described integer is " backrush " just wherein to arrive a certain particular value (that is, modulus) at integer.Use modular arithmetic, phase place (is unit with the radian)-phase place 2 of wiping place, 240 ends will equal:

Phase place 2=(phase place 1+ (160 sample modPP)/PP multiply by 2 π) mod2 π equation 2

For example, when a described pitch period 100-PP=50 PCM sample and described frame had 160 PCM samples, then phase place 2=phase place 1+ (160mod50)/50 multiply by 2 π.(160mod50=10, because 160 be 10) divided by the remainder after the modulus 50.That is whenever the multiple that arrives 50, this numeral just can backrush, thereby stays remainder.This means that the phase differential between the beginning of frame 4 ends and frame 5 is 0.47 π radian.

Turn back to Fig. 2 B, frame 5 is encoded, supposes that its phase place begins in frame 4 phase place ends, that is, have the beginning phase place of phase place 1.But shown in Fig. 2 B, demoder 206 will can not come decoded frame 5 (to note that encoder/decoder has the storer that is used for compressed voice signal with the beginning phase place of phase place 2; The phase place of described encoder/decoder is the phase place of described encoder/decoder place storer).This can cause such as illusions such as click sound, detonans in voice signal.The character of described illusion depends on the type of institute's use vocoder 70.For example, phase place is interrupted and can be introduced sound as the metal a little in described interruptions.

In Fig. 2 B, there is the people perhaps can say, wiping 240 after structure is with replacement frame 5, de-jitter buffer 209 just can need not transmit frame 5 to demoder 206.The numbering of the traceable frame 20 of described de-jitter buffer also guarantees that frame 20 sends according to correct order.Yet this type of frame 20 is sent to demoder 206 two advantages.Usually, wipe 240 reconstruct in demoder 206 and imperfect.Voiced frame 20 can comprise one section and fail by the voice segments of wiping 240 perfect reconstructions.Therefore, play frame 5 and can guarantee that voice segments 110 can not lose.In addition, if this frame 20 is not sent to demoder 206, then might there be next frame 20 in the de-jitter buffer 209.This can cause another to wipe 240 and cause two wiping 240 (that is, two continuous wipe 24).Because a plurality ofly wipe 240 and wipe 240 and cause bigger degrading quality, so this is a problem than single.

As mentioned above, frame 20 can be decoded after it wipes the version decoding immediately, thereby causes scrambler 204 asynchronous with demoder 206 phase places.The inventive method and equipment are attempted to correct owing to the scrambler 204 asynchronous little illusions of introducing in the voice decoder 206 with demoder 206 phase places.

Phase matching

Phase-matching technique described in this section can be used for making decoder memory 207 and encoder memory 205 synchronous.As representative example, the inventive method and equipment can together use with code exciting lnear predict (CELP) vocoder 70 or prototype pitch period (PPP) vocoder 70.Note that proposition uses phase matching only as an example in the environment of CELP or PPP vocoder.Similarly, phase matching is also applicable to other vocoders.Before in the environment of concrete CELP or PPP vocoder 70, proposing described solution, the phase matching method of the inventive method and equipment will be set forth.Can be by in (that is, the frame 5 in Fig. 2 B) skew that begins to have a certain degree of frame 20 (but apart from) and be not frame 20 to be decoded during in beginning to realize repairing as shown in Fig. 2 B afterwards of wiping 240 by wiping 240 interruptions that cause.Therefore, several samples (some information of these several samples) before abandoning in the frame 20 are so that 240 the end of wiping that first sample has with previous frame 20 (that is, the frame 4 among Fig. 2 B) after abandoning has identical phase deviation 136.Described method just can be applicable to CELP or PPP demoder 206 in slightly different mode.Hereinafter will be further elaborated this.

The CELP vocoder

Voiced frame 20 through the CELP coding comprises two kinds of different information, it can be combined to form decoding PCM sample-voiced sound (cycle portions) and non-voiced sound (aperiodic component).Described voiced sound part is made up of adaptive code book (ACB) 210 and gain thereof.The ACB storer that the suitable ACB210 that can use the described part combined with pitch period 100 to be applied to pass through gains and expands previous frame 20.Described non-voiced sound part is made up of fixed code book (FCB) 220, and described fixed code book is about putting on the information of the pulse in the signal 10 at the difference place.Fig. 3 shows that can how to make up ACB 210 forms the frame of decoding through CELP with FCB 220.The left side of the curve plotting of ACB storer 212 dotted line in Fig. 3.Use the ACB part that ACB storer 212 expands and the curve of the FCB pulse 222 of current decoded frame 22 together to be plotted in the right of described dotted line in the described signal.

If the phase place of last sample of previous frame 20 is different from the phase place (as in the situation of current consideration) of first sample in the present frame 20, ACB 210 and FCB 220 will not match (that is, interruption is arranged), wherein previous frame 24 is frames 4 and present frame 22 is frames 5.Among Fig. 4 B this is shown that wherein at a B place, FCB pulse 222 is inserted with non-correct phase place.The pulse 222 that not matching between FCB 220 and the ACB 210 means FCB 220 puts in the signal 10 with the phase place of mistake.When decoded signal 10, this can cause the sound (that is, illusion) as the metal.Notice that Fig. 4 A shows the situation that FCB 220 and ACB 210 mate, that is the phase place of previous frame 24 last sample is identical with the phase place of present frame 20 first samples.

Solution

For addressing this problem, phase matching method of the present invention makes the FCB 220 and the phase matching that is fit in the signal 10.The step of described method comprises:

Obtain in the present frame 22 thereafter phase place and the quantity AN of the sample of the phase portrait of previous frame 24 ends; And

Be shifted described FCB index so that ACB 210 and FCB 220 couplings by AN sample.

More than two steps the results are shown in some C place among Fig. 4 C, wherein FCB pulse 222 displacements and insert with correct phase place.

Frame 20 samples that said method can cause being produced are less than 160, because several FCB 220 index before having abandoned.Then, but the time is twisted described sample to form the sample of a greater number, that is, be used in the method that is disclosed in the temporary patent application case " Time Warping Frames inside the Vocoder by Modifyingthe Residual " of filing an application on March 11st, 2005 and described sample expanded in demoder 206 outsides or in demoder 206 inside.

Prototype pitch period (PPP) vocoder

PPP-coded frame 20 comprises and is used for by formerly carrying out interpolation between frame 24 and the present frame 22 so that the signal of previous frame 20 expands the information of 160 samples.Main difference between CELP and the PPP is only code period information of PPP.

Fig. 5 A shows how PPP expands the signal of previous frame 24 to form more than 160 sample.In Fig. 5 A, present frame 22 finishes at phase place 1 place.As shown in Fig. 5 B, previous frame 24 heels have one to wipe 240, are present frame 22 then.If the beginning phase place of present frame 22 incorrect (as under the situation shown in Fig. 5 B), then present frame 22 will be different from the phase place place end shown in Fig. 5 A.In Fig. 5 B, because frame 20 is to play after wiping 240, so present frame 22 finishes at phase place ph2=ph1 place.This will cause and the interruption that is positioned at the frame 20 after the present frame 22 because the coding of next frame 20 will be in Fig. 5 A the end phase place of present frame 22 equal to carry out under the hypothesis of phase place 1 (ph1).

Solution

Can correct this problem in the following way: produce N=160-x sample from present frame 22, so that the phase matching at the phase place at place, present frame 22 ends and previous place, frame 240 ends through wiping reconstruct.This is shown in (supposing frame length=160 a PCM sample) among Fig. 5 C, and is wherein less from the sample size of present frame 22 generations, so that present frame 22 finishes at phase place ph2=ph1 place.In fact, the end from present frame 22 removes x sample.

If wish to prevent that sample size is less than 160, can produce N=160-x+PP sample from present frame 22, wherein supposing has 160 PCM samples in the described frame.Because described building-up process only expands or interpolation first front signal 10, so can directly produce the sample of variable number from PPP demoder 206.

Using phase matching and distortion to hide wipes

In the data network such as EV-DO, voiced frame 20 may be dropped (Physical layer) often or seriously postpone, and wipes 240 thereby cause de-jitter buffer 209 to be introduced in demoder 206.Even if vocoder 70 normal uses are wiped hiding method, the degradation of sound quality (especially under the situation of high erasure rate) also may be very obvious.Especially when a plurality of continuous erase 240 takes place, just can observe tangible sound quality degradation, because when a plurality of continuous erase takes place, vocoder 70 is wiped 240 hidden method and is tended to make voice signal 10 " decline " usually.

Using de-jitter buffer 209 in such as the data network of EV-DO is to be used for removing shake and providing streamlined input to demoder 206 according to the time of arrival of voiced frame 20.De-jitter buffer 209 is to work by cushioning some frame 20 and in the mode of non-jitter it being provided to demoder 206 then.This wipes 240 hidden method chance is provided for strengthening at demoder 206 places because in the de-jitter buffer 209 time regular meeting exist some " future " frame 26 (with " current " frame 22 of just decoding comparatively speaking).Therefore, erase frame 20 (when frame 20 is discarded in the Physical layer place or arrives too late) if desired, then demoder 206 can use future frame 26 to implement better to wipe 240 to hide.

Can use to hide and wipe 240 from the information of future frame 26.In one embodiment, the inventive method and equipment comprise: future frame 26 is carried out time distortion (expansion) to fill " hole " that forms owing to erase frame 20; And make future frame 26 phase matching to guarantee continuous signal 10.Please consider to be shown in the situation that the wherein voiced frame 4 of Fig. 6 has been decoded.Current voiced frame 5 is unavailable at de-jitter buffer 209 places, but has next voiced frame 6.It is not to play to wipe 240 to hide frame 5 that demoder 206 can twist voiced frame 6.That is, to frame 6 decode and time distortion with the space of infilled frame 5.This is shown as Ref. No. 28 in Fig. 6.

This relates to two following steps:

1) coupling phase place: the end of voiced frame 20 makes voice signal 10 stay a concrete phase place place.As shown in Figure 7, the phase place at place, frame 4 ends is ph1.Used the beginning phase place of ph2 that voiced frame 6 is encoded, ph2 is the phase place at place, voiced frame 5 ends basically, usually ph1 ≠ ph2.Therefore, the decoding of frame 6 need begin with certain skew, so that described beginning phase place equals ph1.

For the beginning phase place ph2 that makes frame 6 and the end phase place ph1 of frame 4 are complementary, need abandon the preceding several samples in the frame 6 so that after abandoning, first sample has with the phase deviation at place, frame 4 ends and has identical phase deviation 136.The previous method of carrying out this phase matching of once setting forth; In addition, also set forth how phase matching is used for CELP and PPP vocoder 70.

2) frame is carried out time distortion (expansion): after making frame 6 and frame 4 phase matching, frame 6 is twisted the sample (that is generation is near 320 PCM samples) that is used for infilled frame 5 " hole " with generation.Can use the time warping method at CELP and PPP vocoder 70 of setting forth after a while to come time warped frame 20.

In an embodiment of phase matching, de-jitter buffer 209 can be followed the trail of two variablees: phase deviation 136 and run length 138.Phase deviation 136 equals not to be decoded into the frame of wiping from last, poor between the number of frames that demoder 206 decoded number of frames and scrambler 204 have been encoded.Run length 138 is defined as the quantity of the continuous erase 240 that demoder 206 decoded before the present frame 22 in decoding.These two variablees can be passed to demoder 206 as input.

An embodiment who wipes is play in Fig. 8 graphic extension wherein demoder 206 after decoded packet 4.Wiping after 240, preparing decoded packet 5.Suppose that scrambler 204 is synchronous at the place, end of grouping 4 with the phase place of demoder 206, all have the phase place that equals Phase_Start.In addition, in the remainder of file, we suppose that the every frame of vocoder produces 160 samples (frame that is wiped free of is like this too).

The state of scrambler 204 and demoder 206 is shown among Fig. 8.Scrambler 204 the grouping 5 phase place that begins to locate=Enc_Phase=Phase_Start. demoders 206 the grouping the 5 phase place=Dec_Phase=Phase_Start+ that begin to locate (160mod postpones (4))/delay (4), wherein every frame has 160 samples, postponing (4) is the pitch delay of frame 4 (in the PCM sample), and hypothesis is wiped 240 pitch delay with the pitch delay that equals frame 4.Phase deviation (136)=1, and run length (138)=1.

In another embodiment shown in Fig. 9, demoder 206 is play after decoded frame 4 and is wiped 240.Wiping after 240, preparing decoded frame 6.Suppose that scrambler 204 was once synchronous at the place, end of frame 4 with the phase place of demoder 206, all have the phase place that equals Phase_Start.The state of scrambler 204 and demoder 206 is shown among Fig. 9.In the embodiment shown in Fig. 9, scrambler 204 is in the grouping 6 phase place=Enc_Phase=Phase_Start+ that begin to locate (160mod postpones (5))/delay (5).

Demoder is in the grouping 6 phase place=Dec_Phase=Phase_Start+ that begin to locate (160mod postpones (4))/delay (4), wherein every frame has 160 samples, postponing (4) is the pitch delay of frame 4 (in the PCM sample), and hypothesis is wiped 240 pitch delay with the pitch delay that equals frame 4.In the case, phase deviation (136)=0, and run length (138)=1.

In another embodiment shown in Figure 10, demoder 206 is decoded after decoded frame 4 two and is wiped 240.Wiping after 240, preparing decoded frame 5.Suppose that scrambler 204 was once synchronous at the place, end of frame 4 with the phase place of demoder 206, all have the phase place that equals Phase_Start.

The state of scrambler 204 and demoder 206 is shown among Figure 10.In the case, scrambler 204 phase place=Enc_Phase=Phase_Start of beginning to locate at frame 6.The phase place 6=Dec_Phase=Phase_Start+ that demoder 206 begins to locate at frame 6 ((160mod postpones (4)) * 2)/delay (4) supposes that wherein each wipes 240 and all have and No. 4 identical delay of frame.In the case, phase deviation (136)=2, and run length (138)=2.

In another embodiment shown in Figure 11, demoder 206 is decoded after decoded frame 4 two and is wiped 240.Wiping after 240, preparing decoded frame 6.The phase place of supposing scrambler 204 and demoder 206 was once synchronous at the place, end of frame 4, all had the phase place that equals Phase_Start.The state of scrambler 204 and demoder 206 is shown among Figure 11.

In the case, scrambler 204 phase place=Enc_Phase=Phase_Start+ (160mod postpones (5))/delay (5) of beginning to locate at frame 6.

Phase place=Dec_Phase=Phase_Start+ that demoder 206 begins to locate at frame 6 ((160mod postpones (4)) * 2)/delay (4) supposes that wherein each wipes 240 and have and No. 4 identical delay of frame.Therefore, wipe the total delay that 240 (one is lost frames 4, and another is lost frames 5) are caused by two and equal 2 times of delays (4).In the case, phase deviation (136)=1, and run length (138)=2.

In another embodiment shown in Figure 12, demoder 206 is decoded after decoded frame 4 two and is wiped 240.Wiping after 240, preparing decoded frame 7.Suppose scrambler 204 and demoder 206 phase-locking, all have the phase place that equals Phase_Start at place, frame 4 ends.The state of scrambler 204 and demoder 206 is shown among Figure 12.

In the case, the scrambler 204 phase place 6=Enc_Phase=Phase_Start+ that begins to locate at frame 6 ((160mod postpones (5))/postpone (5)+(160mod postpones (6))/delay (6)).

Phase place=Dec_Phase=Phase_Start+ that demoder 204 begins to locate at frame 6 ((160mod postpones (4)) * 2)/delay (4).In the case, phase deviation (136)=0, and run length (138)=2.

Hide two wiping

Knownly two hide 240 and can cause more significantly sound quality degradation than single erase 240.Can use previous described same procedure to correct by two 240 phase places that cause of wiping interrupts.Please consider Figure 13, wherein voiced frame 4 has been decoded and frame 5 has been wiped free of.In Figure 13, that uses that warped frame 7 comes infilled frame 6 wipes 240.That is, to frame 7 decode and time distortion with the space of infilled frame 6, it is shown as Ref. No. 29 in Figure 13.

At this moment, wherein there is frame 7 in frame 6 on the contrary not in de-jitter buffer 209.Therefore, can make the end phase matching of frame 7 and erase frame 5, and then it be expanded hole with infilled frame 6.This can be effectively wipes 240 and converts single erase 240 to two.Can convert the remarkable benefit that single erase 240 is realized the sound quality aspect to by will two wiping 240.

In above example, frame 4 and 7 pitch period 100 self are carried by frame 20, and in addition, the pitch period 100 of frame 6 is carried by frame 7.The pitch period 100 of frame 5 is unknown.Yet if the pitch period 100 of frame 4,6 and 7 is similar, the also similar possibility of the pitch period 100 of frame 5 and other pitch periods 100 is very high.

In another embodiment shown in Figure 14 (it shows how to convert two wiping to single erase), demoder 206 is play one and is wiped 240 after decoded frame 4.Wiping after 240, preparing decoded frame 7 (noticing that except that frame 5, frame 6 has also been lost).Therefore, wipe 240 at lost frames 5 and 6 two and will be converted into single erase 240.Suppose scrambler 204 and demoder 206 phase-locking, all have the phase place that equals Phase_Start at place, frame 4 ends.The state of scrambler 204 and demoder 206 is shown among Figure 14.In the case, scrambler 204 is in the grouping 7 phase place=Enc_Phase=Phase_Start+ that begin to locate ((160mod postpones (5))/delay (5)+(160mod postpones (6))/delay (6)).

Demoder 206 supposes wherein that in the grouping 7 phase place=Dec_Phase=Phase_Start+ that begin to locate (160mod postpones (4))/delay (4) described wiping has pitch delay and the length=160PCM sample that equals frame 4 pitch delay.

In the case, phase deviation (136)=-1, and run length (138)=1.Phase deviation 136 equals-1, and this is because use one to wipe 240 alternative two frame-frames 5 and frames 6.

The amount of the phase matching that need finish is:

In the whole embodiment that disclosed, can be with in phase matching and time distortion instruction storage software 216 or the firmware, described software or firmware can be arranged in demoder 206 or be positioned at demoder 206 decoder memory 207 outward.Storer 207 can be the ROM storer, but also can use in a lot of different kinds of memory any, for example, and RAM, CD, DVD, magnetic core etc.

II joint-the time twists

The feature of distortion service time in vocoder

Human sound is made of two kinds of components.Component comprises the first-harmonic to the tone sensitivity, and another is to the insensitive fixedly harmonic wave of tone.The tone that is perceived in the sound is the response of ear to frequency, that is for most of practical uses, tone is a frequency.Harmonic components makes personal voice have unique feature.Its true form with vocal cords and sound channel changes, and is called as resonance peak.

Human sound can be represented by digital signal s (n) 10.Suppose s (n) the 10th, the audio digital signals that during typical conversation, obtains, it comprises different voices and silence period.Preferably, voice signal s (n) 10 is divided framing 20.In one embodiment, with 8kHz s (n) 10 is carried out digital sampling.

Current encoding scheme by take out in the voice intrinsic whole natural redundancies (that is, relevant element) digitized voice signal 10 is compressed into low bit rate signal.Voice show the short term redundancies that is caused by the mechanical action of lip and tongue usually, and show by the caused long term redundancy of the vibration of vocal cords.Linear predictive coding (LPC) comes filtering voice signal 10 by taking out described redundancy, thereby produces remaining voice signal 30.Then, it is modeled to white Gauss noise with the residue signal 30 that is obtained.Can be by several sample 40 and that predict speech waveform in the past sampling values of weighting, each all is multiplied by linear predictor coefficient 50 sample in described several past.Therefore, Linear Predictive Coder is by transmission filter coefficient 50 and quantizing noise, and is not that the whole bandwidth speech signal 10 of transmission realizes the bit rate that reduces.Extract the prototype cycle 100 by present frame 20 and come coded residual signal 30 from residue signal 30.

Can see the calcspar of LPC vocoder 70 at Figure 15.The function of LPC be minimize the difference of two squares between primary speech signal and the estimated speech signal in one section finite duration and.This can produce one group of unique predictive variable coefficient 50, and under normal circumstances each frame 20 all will be estimated described predictive variable coefficient.It is long that frame 20 is generally 20ms.The transfer function of time varying digital filter 75 can be provided by following:

H (z) = \frac{G}{1 - Σ a_{k} z^{- k}},

Wherein said predictive variable coefficient 50 is by a _kRepresentative, and gain is represented by G.

From k=1 to k=p, calculate described summation.If use the LPC-10 method, then p=10.This means that only preceding 10 coefficients transfer to LPC compositor 80.The most frequently used two methods calculating described coefficient are (but being not limited to) covariance method and automatic correlation technique.

Usually, the different speakers speed of speaking is different.Time Compression is a kind of method that reduces the influence of indivedual speaker's velocity variations.Can reduce by two timing differences between the speech pattern by the time shaft that twists one of them speech pattern, overlap with another person's maximum so that obtain.This Time Compression technology is called as the time distortion.In addition, the compressible or expanded sound signal of time distortion and do not change its tone.

Typical vocoders is to produce the frame 20 that the duration is 20 microseconds with preferred 8kHz speed, and described frame comprises 160 samples 90.The pattern of compressing through the time distortion in the frame 20 has the duration less than 20 microseconds, and the expansion pattern of twisting through the time has the duration greater than 20 microseconds.When sending voice data by packet switching network (it can introduce delay jitter in the transmission in voice packets), the time distortion of voice data has some remarkable advantages.In this type of network, the up time twists the influence that alleviates delay jitter and produces the sound stream that seems to be " synchronously ".

Embodiments of the invention relate to a kind of being used for by mediating the Apparatus and method for of the frame 20 in 30 times of the speech residual distortion vocoder 70.In an embodiment, described method and apparatus is used for 4GV.The embodiment that is disclosed comprises some kinds of method and apparatus, is used for the dissimilar 4GV tone section 110 that expansion/compression uses prototype pitch period (PPP), code exciting lnear predict (CELP) or Noise Excitation linear prediction (NELP) to encode.

Term " vocoder " 70 typically refers to the device that comes the compress voiced voice based on the parameter of human speech generation model by extracting.Vocoder 70 comprises scrambler 204 and demoder 206.Relevant parameter is analyzed and extracted to 204 pairs of voice that enter of scrambler.In an embodiment, scrambler comprises wave filter 75.Demoder 206 uses it to synthesize described voice from the parameter that scrambler 204 comes via transmission channel 208 receptions.In an embodiment, described demoder comprises compositor 80.Voice signal 10 usually is divided into plurality of data frame 20 and carries out piece by vocoder 70 and handle.

The person of ordinary skill in the field should be appreciated that human voice can be classified by a lot of different modes.Three traditional phonetic classifications are voiced sound, voiceless sound and transient speech.Figure 16 a is voiced speech signal s (n) 402.Figure 16 A shows measurable general character in the voiced speech, and it is called as pitch period 100.

Figure 16 B is unvoiced speech signal s (n) 404.Unvoiced speech signal 404 is similar to coloured noise.

Figure 16 C illustrates transient speech signal s (n) 406 (that is, neither voiced sound, voice that neither voiceless sound).The example of the transient speech 406 shown in Figure 16 C can be represented s (n) transition between unvoiced speech and the voiced speech.Described three kinds of classification are not to contain all situations.A lot of different phonetic classifications are arranged, can use different phonetic classifications to realize suitable result according to method as herein described.

The 4GV vocoder uses 4 different frame types

The application that employed the 4th generation vocoder (4GV) 70 can be by wireless network in one embodiment of the invention provides attracting feature.Wherein some feature comprises: can carry out compromise selection between quality and the bit rate, when in the face of the packet error probability (PER) that increases more flexible sound encoder, better wipe concealment etc.4GV vocoder 70 can use any one of four different coding devices 204 and demoder 206.Different scramblers 204 and demoder 206 are to operate according to different encoding schemes.Some scrambler 204 more effectively shows the part of some characteristic among the encoding speech signal s (n) 10.Therefore, in one embodiment, can select scrambler 204 and demoder 206 based on the classification of present frame 20.

4GV scrambler 204 is encoded into each frame 20 of voice data one of them type of four kinds of different frame 20 types: prototype pitch period waveform interpolation (PPPWI), code exciting lnear predict (CELP), Noise Excitation linear prediction (NELP) or mourn in silence 1/8 ^ThRate frame.CELP is used for the voice of code period difference or the voice that change to another constant time range from one-period section 110.Therefore, usually, select the CELP pattern to encode and be classified as the frame of transient speech.Because these sections 110 can't only be carried out accurate reconstruct by a prototype pitch period, so CELP encodes to the characteristic of complete voice segments 110.Described CELP pattern uses the quantized version of the remaining signal 30 of linear prediction to come the Excited Linear Prediction channel model.In all scramblers 204 as herein described and demoder 206, CELP can produce more accurate speech reproduction usually, but needs higher bit rate.

The frame 20 that can select prototype pitch period (PPP) pattern to encode and be classified as voiced speech.Voiced speech contain can by described PPP pattern be used slow the time variable period component.Described PPP pattern is encoded to the subclass of the pitch period 100 in each frame 20.The rest period 100 of voice signal 10 is to be reconstructed by interpolation between these prototype cycles 100.By utilizing the periodicity of voiced speech, PPP can realize the bit rate lower than CELP, and still accurately the mode of perception come reproduction speech signal 10.

PPPWI in fact is used for encoding to having periodic speech data.These type of voice are feature with the pitch period 100 of some different being similar to " prototype " pitch periods (PPP).Described PPP is unique acoustic information that scrambler 204 needs coding.Described demoder can use other pitch periods 100 in the described PPP reconstruct voice segments 110.

Can select " Noise Excitation linear prediction " (NELP) scrambler 204 frame 20 of encoding and being classified as unvoiced speech.If voice signal 10 does not almost have or without any the tone structure, then NELP is coded in the signal replication aspect and can operates effectively.More specifically, NELP is used for encoding having the voice of noise sample characteristic, for example, and unvoiced speech or ground unrest.NELP uses and simulates voiced speech through the pseudo-random noise signal of filtering.Can be by producing random signal at demoder 206 places and applying the noise sample characteristic that suitable gain comes reconstruct voice segments 110 to it.NELP uses the simplest model to described encoded voice, and therefore realizes lower bit rate.

The 1/8th ^ThRate frame is to be used for encoding to mourning in silence, that is the user is not the cycle of speaking.

Initial LPC filter shown in the total Fig. 3 of four sound encoder schemes of above-mentioned all.After described voice being divided into one of them of 4 kinds according to its characteristic, send voice signal 10 by linear predictive coding (LPC) wave filter 80, described linear predictive coding wave filter use linear prediction comes the short-term correlativity in the described voice of filtering.The output of this piece is LPC coefficient 50 and " remnants " signal 30, and described residue signal is primary speech signal 10 (its a middle or short term correlativity be removed) basically.Then, use by described sound encoder method and come coded residual signal 30 at the concrete grammar of frame 20 selected uses.

Figure 18 shows the example of primary speech signal 10 and the residue signal 30 after LPC piece 80.Can find out that residue signal 30 more clearly illustrates pitch period 100 than raw tone 10.Therefore, quite reasonable, use residue signal 30, comparable primary speech signal 10 (it also comprises the short-term correlativity) is more accurately determined the pitch period 100 of voice signal.

The residual time distortion

As mentioned above, the time distortion can be used to expansion or compressed voice signal 10.Though a lot of methods all can be used to realize this purpose, these methods of great majority be based on add or deletion from the pitch period 100 of signal 10.Can be after receiving residue signal 30 but in demoder 206, finish the interpolation of pitch period 100 before the composite signal 30 or subtracting each other.For the speech data that uses CELP or PPP (being not NELP) to encode, described signal comprises a lot of pitch periods 100.Therefore, can add or the minimum unit deleted is a pitch period 100,, thereby cause introducing significant voice illusion because anyly all will cause the uncontinuity of phase place than its little unit to voice signal 10.Therefore, a step that is applied in the time warping method of CELP or PPP voice is that pitch period 100 is estimated.Demoder 206 has been known the pitch period 100 of CELP/PPP speech frame 20.Under the two situation of PPP and CELP, scrambler 204 uses automatic calibrating methods to calculate tone information and transmit it to demoder 206.Therefore, demoder 206 can accurately be known pitch period 100.This makes can use time warping method of the present invention more easily in demoder 206.

In addition, as mentioned above, time distortion signal 10 is to be relatively easy to before composite signal 10.If after decoded signal 10, use this type of time warping method, then need the pitch period 100 of estimated signal 10.This not only needs extra calculating, and may not be very accurate to the estimation of pitch period 100, because residue signal 300 comprises LPC information 170 equally.

On the other hand, if the estimation of extra pitch period 100 is not too complicated, then carrying out the time distortion after decoding does not just need demoder 206 is made change, and therefore can only implement a time distortion at all vocoders 80.

The Another reason of why carrying out the time distortion in the demoder 206 before using the synthetic described signal of LPC coding is can be to residue signal 30 applied compression/expansions.This can allow the residue signal 30 of time distortion is used linear predictive coding (LPC) prediction.Described LPC coefficient 50 plays an important role aspect how at sound effect, and uses after distortion 32 and synthesize 34 and can guarantee and will remain in the signal 10 in correct LPC information 170.

If time distortion is to finish decoded residual signal 30 after on the other hand, then LPC synthesizes and implements before twisting in the time.Therefore, described distortion program may change the LPC information 170 of signal 10, especially under the back decoding of pitch period 100 predictions and not really accurate situation.

Depend on that frame 20 is expression voiced sound, voiceless sound or expression transient speech, scrambler 204 (for example a, scrambler in the 4GV) can be categorized into speech frame 20 PPP (periodically), CELP (periodic a little) or NELP (noisy).Use is about the information of speech frame 20 types, and demoder 206 can use diverse ways to come different frame 20 types of time distortion.For example, NELP speech frame 20 is not known pitch period, and its residue signal 30 is to use " at random " information to produce at demoder 206 places.Therefore, pitch period 100 estimations of CELP/PPP also are not suitable for NELP, and usually can be according to twist (expansion/compression) NELP frame 20 less than a pitch period 100.If the time distortion is to implement after the decoded residual signal 30 in demoder 206, just then this type of information is unavailable.Usually, time distortion NELP sample frame 20 can cause the voice illusion after decoding.On the other hand, distortion NELP frame 20 can produce much better quality in demoder 206.

Therefore, with back demoder opposite (that is, after synthetic residue signal 30), the time of in demoder 206, carrying out twist (that is, before synthetic residue signal 30) two advantages are arranged: (i) reduce computing cost (for example, avoiding the search of pitch period 100); And (ii) twist quality and improve, this is because a) know the type of frame 20; B) synthetic to implementing LPC through the signal of distortion; And c) can more accurately estimate/know pitch period.

The residual time warping method

Hereinafter set forth the wherein method and apparatus of the inventive method and equipment time distortion speech residual 30 in PPP, CELP and NELP demoder.In each demoder 206, implement following two steps: (i) 30 times of residue signal are twisted into expansion or compression pattern; And the remnants 30 that (ii) send through the time distortion by LPC wave filter 80.In addition, for PPP, CELP and NELP voice segments 110, the embodiment of step (i) is different.Described embodiment will set forth hereinafter.

When voice segments 110 is PPP, the distortion of the time of residue signal

As mentioned above, when voice segments 110 is PPP, can adds or be a pitch period 100 from the minimum unit of described signal deletion.Can be before according to prototype pitch period 100 decoded signals 10 (and reconstructed residual 30), demoder 206 is inserted into prototype pitch period 100 in the present frame 20 in the previous prototype pitch period 100 (it is stored) with signal 10, thereby adds the pitch period of being lost 100 in described process.This procedure chart is illustrated among Fig. 5.By producing interpolation pitch period 100 more or less, this interpolation can help to carry out more easily the time distortion.This can cause residue signal 30 compressed or expansion, sends by LPC it synthetic then.

When voice segments 110 is CELP, to the time distortion of residue signal

As described in previously, when voice segments 110 is PPP, can adds or be a pitch period 100 from the minimum unit of described signal deletion.On the other hand, under the situation of CELP, distortion is not as so direct under the PPP situation.Remaining 30 for twisting, demoder 206 uses pitch delay 180 information that comprised in the coded frame 20.In fact pitch delay 180 is exactly the pitch delay 180 at place, frame 20 ends.Should notice herein that even in periodic frame 20, pitch delay 180 also can change a little.Carry out the pitch delay 180 that interpolation is estimated any point place in the described frame between the pitch delay that pitch delay 180 that can be by place, frame 20 end in the end and present frame 20 ends are located.This is shown among Figure 20.In case the pitch delay 180 of all points in the given frame 20 then can be divided into frame 20 several pitch periods 100.Can use the pitch delay 180 at difference places in the frame 20 to determine the border of pitch period 100.

Figure 20 A shows the example that how frame 20 is divided into its pitch period 100.For example, No. 70 sample has and approximates 70 pitch delay 180, and No. 142 sample has and be about 72 pitch delay 180.Therefore, pitch period 100 is from sample number [1-70] and from [71-142].Referring to Figure 20 B.

In case frame 20 is divided into pitch period 100, then then can be overlapping/pitch period 100 added to increase/to reduce remaining 30 size.Referring to Figure 21 B to 21F.In overlapping and interpolation were synthesized, obtain modified signal in the following way: the section of cutting 110 from input signal 10; Making it reorientate and implement weighted overlap-add along time shaft adds with structure through synthetic signal 150.In one embodiment, section 110 can equal a pitch period 100.By " merging " voice segments 110, described overlapping-adding method substitutes two different voice segments 110 with a voice segments 110.Merging voice is to finish by the mode that keeps voice quality as much as possible.Can realize the maintenance of voice quality and make the illusion of introducing voice minimum by selecting match section 110 also carefully.(illusion is undesirable such as projects such as click sound, detonans." similarity " of the selection section of being based on of voice segments 110.The similarity of voice segments 110 is near more, and then the voice quality that is obtained can be good more, and to introduce the possibility of voice illusion when reducing/increasing speech residual 30 big or small low more when overlapping two voice segments 110.Being used to determine whether should be overlapping-and the rule of adding pitch period is described two pitch delay whether similar (as an example, whether described pitch delay differs is less than 15 samples, and this is corresponding to about 1.8 microseconds).

Figure 21 C shows how to use overlapping-interpolation to come compressed residual 30.The first step of as explained above, overlapping/adding method is with input sample sequence s[n] 10 be segmented into its pitch period.Show primary speech signal 10 among Figure 21 A, it comprises 4 pitch periods 100 (PP).Next procedure comprises: as shown in Figure 7, the pitch period 100 of taking-up signal 10 is also used through the pitch period 100 that merges and is replaced these pitch periods 100.For example, in Figure 21 C, take out pitch period PP2 and PP3, wherein PP2 and PP3 replace through overlapping-pitch period 100 of adding to be used for one then.More specifically, in Figure 21 C, the component of PP3 increases so that the PP2 component of second pitch period 100 continues to reduce for overlapping-interpolation pitch period 100 PP2 and PP3.Described interpolation-method of superposition produces a voice segments 110 with two different voice segments 110.In an embodiment, use through the sample of weighting and implement described interpolation-overlapping.As showing among Figure 22, equation a) and b) be explained.Use weighting so that between last PCM sample of a PCM of section 1 (110) (pulse code modulation (PCM)) sample and section 2 (110), provide level and smooth excessive.

Figure 21 D is through the overlapping-PP2 of interpolation and another graphical illustration of PP3.When comparing with section of simple taking-up 110 and in abutting connection with remaining contiguous segments 110 (as shown in Fig. 7 E), cross compound turbine can be improved the perceived quality through the signal 10 of described method Time Compression.

Under the situation that pitch period 100 changes, described overlapping-adding method can merge the pitch period 110 of two unequal lengths.Under described situation, can realize better merging by before overlapping/two pitch periods 100 of interpolation, its peak value being aimed at.At last, sending described expanded/compressed residual synthesizes by described LPC.

The voice expansion

The straightforward procedure of extended voice is repeatedly to repeat identical PCM sample.Yet, repeating that identical PCM sample is once above can be formed if the zone with tone flatness, it is the illusion of being felt easily by the mankind (for example, voice sound some " machinery ").For keeping speech quality, can use described interpolation-stacking method.

Figure 21 B show can how to use the present invention superpose-adding method comes expanded voice signal 10.In Figure 21 B, add by pitch period 100PP1 and the formed extra pitch period 100 of PP2.In extra pitch period 100, overlapping-as to add pitch period 100PP2 and PP1, the component of PP1 increases so that (PP2) component in 100 cycles of second tone continues to reduce.Figure 21 F is another graphical illustration through the PP2 of overlapping interpolation and PP3.

When voice segments is NELP, to the time distortion of residue signal

For the NELP voice segments, described scrambler is encoded to the LPC information and the gain of the different piece of voice segments 110.Any other information that there is no need to encode is because described voice are very similar to noise in nature.In one embodiment, described gain coding is become several each groups by 16 PCM compositions of sample.Therefore, for example, the frame by 160 compositions of sample can be represented that per 16 speech samples can be represented by a yield value by 10 coding gain values.Demoder 206 produces residue signal 30 by producing random value and then it being used corresponding gain.In this case, may not have the notion of pitch period 100, and so, described expansion/compression needn't be the granularity of pitch period 100.

For expansion or compression NELP section, depend on expansion or compression section 110, demoder 206 generation quantity are greater than or less than 160 section (110).Then, with described 10 gain application through decoding in described sample to produce remnants 30 through expansion or compression.Since these 10 through decoding gain corresponding to 160 original samples, so not directly with these gain application in described expansion/compression sample.Can use diverse ways to use these gains.Below the some of them of these methods are set forth.

If the sample size of intending producing then need not to use whole 10 gains less than 160.For example, if sample size is 144, then can use preceding 9 gains.In this example, with described first gain application in preceding 16 samples (sample 1-16), with described second gain application in ensuing 16 samples (sample 17-32) etc.Similarly, if sample, then can be used the 10th greater than 160 more than the gain once.For example, if sample size is 192, then can use the 10th gain to sample 145-160,161-176 and 177-192.

Select as another, described sample can be divided into the group of quantity such as 10, each group such as has at the sample of quantity, and can use 10 gains to described 10 groups.For example, if sample size is 140, then can respectively use described 10 gains to some by the group of 14 compositions of sample.In this example, with described first gain application in preceding 14 samples (sample 1-14), with described second gain application in ensuing 14 samples (sample 15-28) etc.

If described sample size fails to eliminate by 10, then the residue sample that the 10th gain application can be obtained after divided by 10.For example, if sample size is 145, then can use described 10 gains to some groups by 14 compositions of sample.In addition, sample 141-145 is used the 10th gain.

After time distortion, state in the use coding method any one the time, send through the remnants 30 of expansion/compression synthetic by described LPC.

Also can use the device shown in Figure 23 to add functional block and illustrate the inventive method and application, disclose the device and the time that are used for phase matching 213 among Figure 23 and twisted 214 device.

The person of ordinary skill in the field should be appreciated that, any that can use various different technologies and skill and technique comes expression information and signal.For example, data, instruction, order, information, signal, position, symbol and the chip that may mention in the whole above-mentioned explanation can be represented by voltage, electric current, electromagnetic wave, magnetic field or particle, light field or particle or its arbitrary combination.

The person of ordinary skill in the field should be further appreciated that the various illustrative components, blocks set forth in conjunction with embodiment disclosed herein, module, circuit, and algorithm steps can be configured to electronic hardware, computer software or the combination of the two.For clearly illustrating the interchangeability of hardware and software, above be to set forth various Illustrative components, square, module, circuit, and step from functional aspect.This kind be functional to be built into the design constraints that hardware or software depend on application-specific and puts on total system.The person of ordinary skill in the field all can implement described functional at each application-specific by different way, is interpreted as causing deviating from category of the present invention but these should not implemented decision.

In conjunction with the described various illustrative components, blocks of embodiment disclosed herein, module, and circuit can be by making up or implement as lower device: general processor, digital signal processor (DSP), application specific integrated circuit (ASIC), a programmable gate array (FPGA) or other programmable logical unit, discrete gate or transistor logic, discrete hardware components or its are designed for arbitrary combination of execution function described herein.General processor can be microprocessor, but selects as another, and processor also can be any traditional processor, controller, microcontroller or state machine.Processor also can be configured to the combination of calculation element, for example, and the associating of the combination of DSP and microprocessor, the combination of a plurality of microprocessors, one or more microprocessor and DSP core, or arbitrary other this type of configuration.

Step in conjunction with described method of embodiment disclosed herein or algorithm can be embodied directly in the hardware, be implemented in the software module of being carried out by processor or be implemented in the combination of the two.Software module can reside on random-access memory (ram), flash memory, ROM (read-only memory) (ROM), electronics programmable ROM (EPROM), electronics and can wipe in the arbitrary other forms of medium known in sequencing ROM (EEPROM), register, hard disk, removable disk, CD-ROM or this technology.The illustrative medium is coupled to processor, so that described processor can read information or information is write wherein from described medium.In replacement scheme, described Storage Media can be the ingredient of described processor.Described processor and Storage Media can reside among the ASIC.Described ASIC then can reside in user's terminating machine.In described replacement scheme, described processor and Storage Media can be used as discrete component and reside in user's terminating machine.

Provide the above description of the disclosed embodiments to be intended to make to be familiar with this operator can make or use the present invention.The person of ordinary skill in the field will be easy to draw the various modifications of described embodiment, and the General Principle that this paper defined is also applicable to other embodiment, and this does not deviate from purport of the present invention or category.Therefore, this paper is intended to the present invention is defined in embodiment illustrated herein, and desires to give itself and principle disclosed herein and novel feature the broadest corresponding to category.

Claims

1. one kind makes the minimized method of illusion in the voice, and it comprises:

Frame is carried out phase matching.

2. the minimized method of illusion that makes in the voice as claimed in claim 1, the step of wherein said phase matching comprises the sample size that changes described frame.

3. the minimized method of illusion that makes in the voice as claimed in claim 1, the step of wherein said phase matching comprises:

Obtain in the present frame thereafter phase place and the quantity of the sample of the described phase portrait of previous frame end; And make the fixed codebook indices described sample size that is shifted, so that adaptive codebook and described fixed codebook are complementary.

4. the minimized method of illusion that makes in the voice as claimed in claim 1, it further comprises:

Described frame is carried out the time distortion.

5. the minimized method of illusion that makes in the voice as claimed in claim 1, the step of wherein said phase matching comprises:

Deduct the scrambler phase place from the demoder phase place, it is first poor to form thus, and if described demoder phase place more than or equal to described scrambler phase place, then make described first difference multiply by pitch delay; And

Deduct the demoder phase place from the scrambler phase place, it is second poor to form thus, and if described demoder phase place less than described scrambler phase place, then make described second difference multiply by pitch delay.

6. the minimized method of illusion that makes in the voice as claimed in claim 2, the step of the sample size of the described frame of wherein said change comprises: follow frame after wiping with a side-play amount decoding that departs from the beginning of described frame, the phase deviation that the end of the frame before first sample of wherein said frame and described the wiping is located has identical phase deviation.

7. the minimized method of illusion that makes in the voice as claimed in claim 2, the step of the sample size of the described frame of wherein said change comprises:

Abandon the sample of present frame, wherein the phase place at place, present frame end is complementary with the described phase place of before wiping place, reconstructed frame end.

8. the minimized method of illusion that makes in the voice as claimed in claim 2, it further comprises the step of described frame being carried out the time distortion.

9. the minimized method of illusion that makes in the voice as claimed in claim 3, it further comprises described frame is carried out time distortion.

10. the minimized method of illusion that makes in the voice as claimed in claim 5, it further comprises described frame is carried out time distortion.

11. the minimized method of illusion that makes in the voice as claimed in claim 6, it further comprises described frame is carried out time distortion.

12. the minimized method of illusion that makes in the voice as claimed in claim 7, it further comprises described frame is carried out time distortion.

13. the minimized method of illusion that makes in the voice as claimed in claim 9, the step of wherein said time distortion comprises:

Estimate pitch period; And

After receiving described residue signal, add in the described pitch period at least one.

14. the minimized method of illusion that makes in the voice as claimed in claim 9, the step of wherein said time distortion comprises:

Estimate pitch delay;

Speech frame is divided into pitch period, wherein uses the described pitch delay at difference place in the described speech frame to determine the border of described pitch period; And

If increase described remaining voice signal, then add described pitch period.

15. the minimized method of illusion that makes in the voice as claimed in claim 10, the step of wherein said time distortion comprises:

Estimate pitch period; And

16. the minimized method of illusion that makes in the voice as claimed in claim 10, the step of wherein said time distortion comprises:

Estimate pitch delay;

If increase described remaining voice signal, then add described pitch period.

17. the minimized method of illusion that makes in the voice as claimed in claim 10, the step of wherein said time distortion comprises the steps:

Estimate at least one pitch period;

Described at least one pitch period of interpolation; And

When the described remaining voice signal of expansion, add described at least one pitch period.

18. the minimized method of illusion that makes in the voice as claimed in claim 12, the step of wherein said time distortion comprises the steps:

Estimate at least one pitch period;

Described at least one pitch period of interpolation; And

19. being included between the pitch delay at end of the end of last frame and present frame, the minimized method of illusion that makes in the voice as claimed in claim 14, the step of wherein said estimation pitch delay carry out interpolation.

20. the minimized method of illusion that makes in the voice as claimed in claim 14, the step of the described pitch period of wherein said interpolation comprises the merging voice segments.

21. the minimized method of illusion that makes in the voice as claimed in claim 14 is if the wherein said step that increases described remaining voice signal then add described pitch period comprises: add the extra pitch period that is formed by the first tone section and the second pitch period section.

22. the minimized method of illusion that makes in the voice as claimed in claim 21, wherein said interpolation is comprised by the step of the extra pitch period that the first tone section and the second pitch period section form: add described first and the described second tone section, the component of the described second pitch period section reduces so that the component of the described first pitch period section increases.

23. a vocoder, it has at least one input and at least one output, and described vocoder comprises:

Scrambler, it comprises wave filter, but described wave filter has the input and at least one output of at least one input that is connected to described vocoder with mode of operation; And

Demoder, it comprises compositor, but but described compositor has the input of at least one at least one output that is connected to described scrambler with mode of operation and at least one is connected to the output of described at least one output of described vocoder with mode of operation, wherein said scrambler further comprises storer, and wherein said demoder is suitable for carrying out the instruction that is stored in the described storer, and described instruction comprises makes the frame phase matching.

24. vocoder as claimed in claim 23, wherein said phase matching instruction comprises the sample size that changes described frame.

25. vocoder as claimed in claim 23, wherein said phase matching instruction comprises:

26. vocoder as claimed in claim 23, it further comprises the instruction of time distortion.

27. vocoder as claimed in claim 23, the instruction of wherein said phase matching comprises:

28. vocoder as claimed in claim 24, the instruction of the described frame sample size of wherein said change comprises: decode with a side-play amount that departs from the beginning of described frame and follow frame after wiping, the phase deviation at the place, end of the frame before first sample of wherein said frame and described the wiping has identical phase deviation.

29. vocoder as claimed in claim 24, the instruction of the described frame sample size of wherein said change comprises:

30. vocoder as claimed in claim 24, it further comprises time distortion instruction.

31. vocoder as claimed in claim 25, it further comprises time distortion instruction.

32. vocoder as claimed in claim 27, it further comprises time distortion instruction.

33. vocoder as claimed in claim 28, it further comprises time distortion instruction.

34. vocoder as claimed in claim 29, it further comprises time distortion instruction.

35. vocoder as claimed in claim 31, wherein said time distortion instruction comprises:

Estimate pitch period; And

36. vocoder as claimed in claim 31, wherein said time distortion instruction comprises:

Estimate pitch delay;

If increase described remaining voice signal, then add described pitch period.

37. vocoder as claimed in claim 32, wherein said time distortion instruction comprises:

Estimate pitch period; And

38. vocoder as claimed in claim 32, wherein said time distortion instruction comprises:

Estimate pitch delay;

If increase described remaining voice signal, then add described pitch period.

39. vocoder as claimed in claim 32, wherein said time distortion instruction comprises:

Estimate at least one pitch period;

Described at least one pitch period of interpolation; And

40. vocoder as claimed in claim 34, wherein said time distortion instruction comprises:

Estimate at least one pitch period;

Described at least one pitch period of interpolation; And

41. vocoder as claimed in claim 36, the instruction of wherein said estimation pitch delay comprises: in the end carry out interpolation between the pitch delay at the end of the end of a frame and present frame.

42. vocoder as claimed in claim 36, the described pitch period instruction of wherein said interpolation comprises the merging voice segments.

43. vocoder as claimed in claim 36 is if the wherein said instruction that increases described remaining voice signal then add described pitch period comprises: add the extra pitch period that is formed by the first tone section and the second pitch period section.

44. vocoder as claimed in claim 43, wherein said interpolation is comprised by the instruction of the extra pitch period that the first tone section and the second pitch period section form: add described first and the described second tone section, the component of the described second pitch period section reduces so that the component of the described first pitch period section increases.

45. the minimized device of illusion that is used for making voice, it comprises:

Be used to make the device of frame phase matching.

46. the minimized device of illusion that is used for making voice as claimed in claim 45, the wherein said device that is used for phase matching comprise the device of the sample size that is used to change described frame.

47. the minimized device of illusion that is used for making voice as claimed in claim 45, the wherein said device that is used for phase matching comprises:

Be used for obtaining the present frame phase place thereafter and the device of the sample size of the described phase portrait of previous frame end; And

Be used to make fixed codebook indices to be shifted described sample size so that the device that adaptive codebook and described fixed codebook are complementary.

48. the minimized device of illusion that is used for making voice as claimed in claim 45, it further comprises:

Be used for described frame is carried out the device of time distortion.

49. the minimized device of illusion that is used for making voice as claimed in claim 45, the wherein said device that is used for phase matching comprises:

Device, it is used for deducting the scrambler phase place from the demoder phase place, it is first poor to form thus, and if described demoder phase place more than or equal to described scrambler phase place, then make described first to differ from and multiply by pitch delay; And

Device, it is used for deducting the demoder phase place from the scrambler phase place, it is second poor to form thus, and if described demoder phase place less than described scrambler phase place, then make described second to differ from and multiply by pitch delay.

50. the minimized device of illusion that is used for making voice as claimed in claim 46, the wherein said device that is used to change the sample size of described frame comprises: be used for decoding with a side-play amount that departs from the beginning of described frame and follow the device of the frame after wiping, the phase deviation at the place, end of the frame before first sample of wherein said frame and described the wiping has identical phase deviation.

51. the minimized device of illusion that is used for making voice as claimed in claim 46, the wherein said device that is used to change the sample size of described frame comprises:

Be used to abandon the device of present frame sample, wherein the phase place at place, present frame end is complementary with the described phase place of before wiping place, reconstructed frame end.

52. the minimized device of illusion that is used for making voice as claimed in claim 46, it further comprises the device that is used for described frame is carried out the time distortion.

53. the minimized device of illusion that is used for making voice as claimed in claim 47, it further comprises the device that is used for described frame is carried out the time distortion.

54. the minimized device of illusion that is used for making voice as claimed in claim 49, it further comprises the device that is used for described frame is carried out the time distortion.

55. the minimized device of illusion that is used for making voice as claimed in claim 50, it further comprises the device that is used for described frame is carried out the time distortion.

56. the minimized device of illusion that is used for making voice as claimed in claim 51, it further comprises the device that is used for described frame is carried out the time distortion.

57. the minimized device of illusion that is used for making voice as claimed in claim 53, the wherein said device that is used for the time distortion comprises:

Be used to estimate the device of pitch period; And

Be used for after receiving described residue signal, adding at least one device of described pitch period.

58. the minimized device of illusion that is used for making voice as claimed in claim 53, the wherein said device that is used for the time distortion comprises:

Be used to estimate the device of pitch delay;

Be used for speech frame is divided into the device of pitch period, the border of wherein said pitch period is to use that in the described speech frame the described pitch delay at difference place is determined; And

If the device that is used for increasing described remaining voice signal then adds described pitch period.

59. the minimized device of illusion that is used for making voice as claimed in claim 54, the wherein said device that is used for the time distortion comprises:

Be used to estimate the device of pitch period; And

60. the minimized device of illusion that is used for making voice as claimed in claim 54, the wherein said device that is used for the time distortion comprises:

Be used to estimate the device of pitch delay;

61. the minimized device of illusion that is used for making voice as claimed in claim 54, the wherein said device that is used for the time distortion comprises:

Be used to estimate the device of at least one pitch period;

The device that is used for described at least one pitch period of interpolation; And

Be used for when the described remaining voice signal of expansion, adding the device of described at least one pitch period.

62. the minimized device of illusion that is used for making voice as claimed in claim 56, the wherein said device that is used for the time distortion comprises:

Be used to estimate the device of at least one pitch period;

63. the minimized device of illusion that is used for making voice as claimed in claim 58, the wherein said device that is used to estimate pitch delay comprise the device that carries out interpolation between the pitch delay at end of the end that is used for a frame in the end and present frame.

64. the minimized device of illusion that is used for making voice as claimed in claim 58, the wherein said device that is used to add described pitch period comprises the device that is used to merge voice segments.

65. the minimized device of illusion that is used for making voice as claimed in claim 58 is if the wherein said device that increases described remaining voice signal then add described pitch period comprises the device that is used to add the extra pitch period that is formed by the first tone section and the second pitch period section.

66. as the described minimized device of illusion that is used for making voice of claim 65, the wherein said device that is used to add the extra pitch period that is formed by the first tone section and the second pitch period section comprises and is used to add described first and the described second tone section device that the component of the described second pitch period section reduces so that the component of the described first pitch period section increases.