CN1437747A

CN1437747A - Closed-loop multimode mixed-domain linear prediction (MDLP) speech coder

Info

Publication number: CN1437747A
Application number: CN00819221A
Authority: CN
Inventors: A·达斯
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2000-02-29
Filing date: 2000-02-29
Publication date: 2003-08-20
Anticipated expiration: 2020-02-29
Also published as: WO2001065544A1; JP2003525473A; HK1055833A1; DE60031002D1; EP1259957B1; JP4907826B2; ES2269112T3; DE60031002T2; KR20020081374A; EP1259957A1; ATE341074T1; KR100711047B1; CN1266674C; AU2000233851A1

Abstract

A closed-loop, multimode, mixed-domain linear prediction (MDLP) speech coder includes a high-rate, time-domain coding mode, a low-rate, frequency-domain coding mode, and a closed-loop mode-selection mechanism for selecting a coding mode for the coder based upon the speech content of frames input to the coder transition speech frames are encoded with the high-rate, time-domain coding mode, which may be a CELP coding mode. Voiced speech frames are encoded with the low-rate, frequency-domain coding mode, which may be a harmonic coding mode. Phase parameters are not encoded by the frequency-domain coding mode, and are instead modeled in accordance with, e.g., a quadratic phase model. For each speech frame encoded with the frequency-domain coding mode, the initial phase value is taken to be the initial phase value of the immediately preceding speech frame encoded with the frequency-domain coding mode.

Description

Closed-loop multimode mixed-domain linear prediction (MDLP) audio coder ﹠ decoder (codec)

Background technology

Technical field

The present invention is generally about the speech processes field, particularly about the method and apparatus of a kind of closed loop, multimode, hybrid domain encoding and decoding speech.

Background

Utilize the speech transmissions of digital technology to become very extensive, especially in the application of long-distance and digital cordless phones.This situation has caused the interest that can send the minimum number of information under the prerequisite of reproducing the speech perception quality keeping on information to determining conversely.If just transmit language after sampling and the digitizing simply, be the speech quality that the data rate of per second 64 kilobits (kbps) obtains the conventional simulation phone with regard to needing the order of magnitude., by the use of speech analysis, and suitable encoding and decoding, transmission and synthetic again the take over party, can obtain significantly reduced data rate.

By the parameter relevant with human pronunciation model of extracting, operation technique comes the equipment of compressed voice to be called audio coder ﹠ decoder (codec).Audio coder ﹠ decoder (codec) can be decomposed into time block or analysis frame with the voice signal of input.Audio coder ﹠ decoder (codec) generally comprises a scrambler and a demoder.The speech frame of scrambler analysis input with some correlation parameter of extracting, becomes binary representation with these parameter quantifications then, promptly is quantized into the grouping of bit group or binary data.Packet is arrived receiver and demoder by traffic channel.These packets of decoder processes remove to quantize it producing parameter, and use parameter that these removals have quantized synthetic speech frame again.

The function of audio coder ﹠ decoder (codec) is by removing natural redundancies intrinsic in all voice, digitized voice signal being compressed into the signal of certain low bit rate.By quantizing to represent that with one group of bit these parameters obtain this digital compression with one group of parametric representation input speech frame and utilization.If the input speech frame has amount of bits N _i, and the packet that is produced by audio coder ﹠ decoder (codec) has amount of bits N _o, then the compressibility factor of this audio coder ﹠ decoder (codec) is C _r=N _i/ N _oProblem is to keep the high speech quality of decoding back voice when reaching the targeted compression factor.The performance of audio coder ﹠ decoder (codec) depends on: how (1) speech model or above-mentioned analysis and the combined running of building-up process get, and (2) are at every frame N _oHow the work of parameter quantification process gets under the target bit rate of bit.Therefore the target of speech model is the essence of catching voice signal with a small set of parameters of every frame, or target speech quality.

Audio coder ﹠ decoder (codec) can be implemented as the time domain codec, and it is each time by using the high time resolution processing procedure to catch the time domain speech waveform as segment voice (being generally 5 milliseconds of (ms) subframes) coding.At each subframe, can utilize multiple searching algorithm known in the art to find certain high-precision expression from a code book space.On the other hand, can realize that also audio coder ﹠ decoder (codec) is the frequency domain codec, it catches the short-term voice spectrum (analysis) of input speech frame with one group of parameter, and uses corresponding building-up process to come to rebulid speech waveform from these frequency spectrum parameters.According to the known quantification technique described in " vector quantization and signal compression " (1992) of A.Gersho and R.M.Gray, the parameter quantification device is preserved them by represent these parameters with the expression of code vector of storage.

A kind of well-known time domain audio coder ﹠ decoder (codec) is at (CELP) codec (all being enclosed in for your guidance) of L.B.Rabiner and R.W.Schafer " digital processing of voice signal " (1978) 396-453 page or leaf described " code excites linear prediction ".In the CELP codec, the short-term in the voice signal is correlated with or redundant the analysis by a linear prediction (LP) removed, and this analyzes the coefficient of seeking a short-term resonance peak wave filter.This short-term forecasting filter applies in the input speech frame, is just produced a LP residual signal, and this signal will be further originally simulated and is quantized with long-term forecasting filter parameter and random code of following.Like this, the CELP encoding and decoding just will be resolved into different LP short-term filter coefficient coding tasks and LP residue coding task to the task of time domain speech waveform coding.The time domain encoding and decoding can (be that every frame uses same bit number N with fixed rate _o) carry out, also can carry out with variable bit rate (dissimilar content frames is used different bit rates).Variable-rate codec attempts only to use the necessary amount of bits of proper level for the acquisition target quality to come codec parameters is encoded.Described a kind of exemplary variable bit rate CELP codec in United States Patent (USP) numbering 5,414,796, this patent has been transferred the possession of and has been given assignee of the present invention, and all is enclosed in for your guidance.

The time domain codec such as the CELP codec needs a large amount of bit N of every frame _oNumber is to keep the correctness of time domain speech waveform.If every frame bit number N _oConsiderably big (as 8kbps or more than), then such codec generally can provide outstanding speech quality.But at low bit rate (4kbps or following), because available bits quantity is limited, the time domain codec is difficult to keep high-quality and strengthens the property.At the low bit rate place, limited code book space constraint two-forty is commercial use in Waveform Matching ability of the conventional time domain codec that uses of success like this.

One research interest tide and strong business demand is arranged now, develop a kind of work in middle low bit rate (that is, 2.4 in the 4kbps scope or following) the high-quality voice codec.Its application comprises wireless telephone, satellite communication, Internet Protocol telephone, multiple multimedia and speech stream application, voice mail and other voice storage system.Its driving force is the demand of high power capacity and the requirement to strengthening the property under the packet loss environment.The standardized effort of many encoding and decoding speechs in modern age also is the another kind of directly driving force that advances the research and development of low rate voice coding/decoding algorithms.The low rate audio coder ﹠ decoder (codec) can produce more channel or user for the application bandwidth of each permission, can adapt to the overall bit budget of codec specification and combine the extra suitably low rate audio coder ﹠ decoder (codec) of channel coding/decoding layer, and under the channel error condition, provide and strengthen the property.

For with the low bit rate encoding and decoding, many methods of frequency spectrum or frequency domain, encoding and decoding speech have been developed, wherein, as certain time dependent evolution of spectrum beam analysis voice signal (referring to " the sinusoidal wave encoding and decoding " of R.J.McAulay and T.F.Quatieri in " encoding and decoding speech is with synthetic " (W.B.Kleijn and K.K.Paliwal edit, nineteen ninety-five) the 4th chapter).In the frequency spectrum codec, purpose is with one group of frequency spectrum parameter the short-term voice spectrum of each phonetic entry frame to be simulated or predicted, rather than accurately simulates time dependent speech waveform.Then these frequency spectrum parameters are encoded, and with voice output frame of decoded parameter generating.Its result's synthetic speech is not complementary with original input speech waveform, but similar perceived quality can be provided.Comprise that at well-known in the art exemplary frequency domain codec multiband excites codec (MBE), sine transform codec (STC) and harmonic wave codec (HC).These frequency domain codecs can provide high-quality parameter simulation, and have the parameter of one group of compactness that can accurately quantize with a small amount of bit at low bit rate.

Yet, the low bit rate encoding and decoding have brought the serious restriction in limited encoding and decoding resolution or limited code book space, the effectiveness that this has limited single encoding and decoding mechanism causes codec can't represent dissimilar sound bites with same degree of accuracy under the different background condition.For example, the Chang Gui low bit rate frequency domain codec phase information of transmission voice frames not.As an alternative phase information be to use certain at random, the artificial prima facies place value that produces and linear interpolation technology reproduce.Referring to " being used for the synthetic quadratic phase interpolation method of speech sound in the MBE model " of " 29 envelope electronics mail " (in May, 1993) 856-857 page or leaf H.Yang etc.).Because phase information is artificial the generation, even quantifications-removals quantizing process ideally keeps sinusoidal wave amplitude, by the output voice of frequency domain codec generation can be not in full accord with original input voice yet (that is, mainly pulse can be not synchronously).Therefore, in the frequency domain codec, be difficult to adopt any closed-loop characteristic measured value, for example signal to noise ratio (snr) or perception SNR.

Used the multimode encoding and decoding technique to carry out low rate encoding and decoding speech together with the open loop mode decision process.In " multimode and the variable rate voice encoding and decoding " of Amitava Das in " encoding and decoding speech with synthetic " (W.B.Kleijn and K.K.Paliwal edit, nineteen ninety-five) the 7th chapter etc., a kind of like this multimode encoding and decoding technique has been described.Conventional multimode codec is used different pattern or coding-decoding algorithm at dissimilar input speech frames.Every kind of pattern of fortune system or coding-decode procedure are represented the sound bite of certain particular type, for example speech sound, unvoiced speech or ground unrest (non-voice) in most effective mode.Outside open loop mode decision mechanism inspection input speech frame, and make the judgement which kind of pattern is applied to this frame.Generally by some parameter of from incoming frame, extracting, estimate these parameters with some with feature frequency spectrum temporarily, on the basis of estimating, make mode decision again and carry out open loop mode and judge.Therefore, in advance and do not know to export voice precise conditions (promptly with regard to speech quality or other performance measurement output voice and input voice will be how near and make mode decision.

Based on above-mentioned situation, be sought after providing a kind of low bit rate frequency domain codec that can estimate phase information more accurately.If a kind of multimode, hybrid domain codec can be provided, according to the content of speech frame some speech frame is carried out time domain coding, other speech frame is carried out Frequency Domain Coding, will be more favourable then.Be desirable to provide a kind of hybrid domain codec again, it can carry out time domain coding to some speech frame according to certain closed loop encoding/decoding mode decision mechanism, and other speech frame is carried out Frequency Domain Coding.Therefore, need the audio coder ﹠ decoder (codec) of a kind of closed loop, multimode, hybrid domain, guarantee the output voice that produce by codec and be input to time synchronized between the raw tone of codec.

Brief summary of the invention

The invention relates to the audio coder ﹠ decoder (codec) of a kind of closed loop of guaranteeing the output voice that produce by codec and being input to time synchronized between the raw tone of codec, multimode, hybrid domain.Correspondingly, in one aspect of the invention, the speech processor of multimode, hybrid domain advantageously comprises a codec with at least a time domain encoding/decoding mode and at least a frequency domain encoding/decoding mode, combine with codec with one, and the content that is configured to according to speech processor institute processed frame is the closed loop mode selecting arrangement that codec is selected encoding/decoding mode.

In another aspect of this invention, a kind of method of processed frame advantageously comprises following steps: each continuous incoming frame is applied an open loop encoding/decoding mode selection course, select time domain pattern or frequency domain encoding/decoding mode with the voice content according to incoming frame; If the voice content of incoming frame is represented the speech sound of steady state (SS), then incoming frame is carried out the frequency domain encoding and decoding; If the voice content of incoming frame represents not to be other any content of the speech sound of steady state (SS), then incoming frame is carried out the time domain encoding and decoding; The frame and the incoming frame that compare the frequency domain encoding and decoding are to obtain certain performance measurement; If this performance measurement drops to certain below the predefined threshold value, then incoming frame is carried out the time domain encoding and decoding.

In another aspect of this invention, the speech processor of multimode, hybrid domain.Advantageously comprise: an open loop encoding/decoding mode selection course is put on incoming frame, to select the device of time domain or frequency domain encoding/decoding mode according to the voice content of incoming frame; If the voice content of incoming frame is represented the speech sound of steady state (SS), receive the device that incoming frame is carried out the frequency domain encoding and decoding; If the voice content of incoming frame represents not to be other any content of the sound language of steady state (SS), the device that then incoming frame is carried out the time domain encoding and decoding; The frame and the incoming frame that compare the frequency domain encoding and decoding are to obtain the device of performance measurement; And if this performance measurement drops to certain below the predefined threshold value, then incoming frame is carried out the device of time domain encoding and decoding.

The accompanying drawing summary

Fig. 1 is the block scheme of every end by the communication channel of audio coder ﹠ decoder (codec) termination;

Fig. 2 is the block scheme that can be used for the scrambler of multimode mixed-domain linear prediction (MDLP) audio coder ﹠ decoder (codec);

Fig. 3 is the block scheme that can be used for the demoder of multimode MDLP audio coder ﹠ decoder (codec);

Fig. 4 is a process flow diagram, and the performed MDLP coding step of MDLP scrambler that can be used for Fig. 2 scrambler is shown;

Fig. 5 is a process flow diagram, and the encoding and decoding speech decision process is shown;

Fig. 6 is the block scheme of closed-loop multimode MDLP audio coder ﹠ decoder (codec);

Fig. 7 is the block scheme that can be used for the frequency spectrum codec of Fig. 6 codec or Fig. 2 scrambler;

Fig. 8 is the amplitude-frequency curve map, is illustrated in the sinusoidal wave amplitude in the harmonic wave codec;

Fig. 9 is a process flow diagram, is illustrated in the mode decision process in the multimode MDLP codec;

Figure 10 A is voice signal amplitude-time plot; And Figure 10 B is linear prediction (LP) residual amplitude-time plot;

Figure 11 A is a curve map, is the curve map of speed/pattern-frame index under the closed loop coding is judged; Figure 11 B is the curve map of perception signal to noise ratio (S/N ratio) (PSNR)-frame index under closed loop is judged; And Figure 11 C is the curve map of speed/pattern and PSNR-frame index when not existing closed loop coding to judge.

The detailed description of preferred embodiment

In Fig. 1, first scrambler 10 receives digitaling speech sampling s (n), and these sampling s (n) are encoded, to transmit toward first demoder 14 by a transmission medium 12 or communication channel 12.Demoder 14 is decoded to these encoded voice samplings, and synthesizes an output voice signal S SYNTH(n).For reciprocal transmission, to encode by 16 couples of digitaling speech sampling s of second scrambler (n), these samplings are by 18 transmission of a communication channel.Second demoder 20 these phonetic sampling of reception also decoded, and produce a synthetic output voice signal S SYNTH(n).

Phonetic sampling s (n) expression is any according in many in the art the whole bag of tricks of knowing, leads as the μ rule or the A of pulse code modulation (pcm), companding, carried out the voice signal of digitizing and quantification.As known in the art, these phonetic samplings s (n) is organized into some frames of importing data, wherein, every frame has all comprised the digitaling speech sampling s (n) of predetermined quantity.In certain exemplary embodiment, used the sample frequency of 8kHz, promptly the frame of each 20ms comprises 160 samplings.Among the embodiment that is described below, can to the basis of frame, advantageously change message transmission rate at frame, from 8kbps (full rate) to 4kbps (half rate) to 2kbps (1/4th speed) to 1kbps (1/8th speed).On the other hand, also can select other data rates for use.As used herein, term " full rate " or " two-forty " refer generally to the data rate more than or equal to 8kbps, and term " half rate " or " low rate " refer generally to be less than or equal to the data rate of 4kbps.The incremental data transfer rate is useful, because for the frame that comprises less relatively voice messaging, can selectively use lower bit rate.Those skilled in the art that are appreciated that also can use other sample frequency, frame sign and message transmission rate.

First scrambler 10 and second demoder 20 are formed first audio coder ﹠ decoder (codec) together, or the speech encoding demoder.Equally, second scrambler 16 and first demoder 14 are formed second audio coder ﹠ decoder (codec) together.Those skilled in the art that crowd knows that audio coder ﹠ decoder (codec) can be realized by digital signal processor (DSP), special IC (ASIC), discrete gate logical circuit, firmware or any conventional programmable software modules and microprocessor.Software module can reside in RAM storer, flash memory, register or any other form known in the art and can write in the medium.On the other hand, any conventional processors, controller or state machine can select to substitute microprocessor.In United States Patent (USP) numbering 5,727,123 (have transferred assignee of the present invention, and all be enclosed in for your guidance) and proposed on February 16th, 1994, the U.S. Patent application sequence numbering 08/197 of " vocoder special IC (ASIC) " by name, in 417 (transferred assignee of the present invention, and all be enclosed in for your guidance), the exemplary ASIC that is in particular encoding and decoding speech and designs has been described.

According to an embodiment who describes among Fig. 2, multimode mixed-domain linear prediction (MDLP) scrambler 100 that can be used for audio coder ﹠ decoder (codec) comprises a mode decision module 102, spacing estimation module 104, linear prediction (LP) analysis module 106, LP analysis filter 108, a LP quantization modules 110 and a MDLP residue scrambler 112.Input speech frame s (n) is offered mode decision module 102, spacing estimation module 104, linear prediction (LP) analysis module 106 and LP analysis filter 108.Mode decision module 102 produces a mode index I according to cycle of each input speech frame s (n) and other parameter of extracting such as energy, spectral tilt, zero crossing speed etc. _MWith a pattern M.Propose on March 11st, 1997, the U.S. Patent application sequence numbering 08/815 of " carrying out the method and apparatus of the variable bit rate voice encoding/decoding of changing down " by name, described in 354 (transferred assignee of the present invention, and all be enclosed in for your guidance) according to the several different methods of cycle the speech frame classification.Such method also is bonded among industry tentative standard TIA/EIA IS-127 of telecommunications industry association and the TIA/EIA IS-733.

Spacing estimation module 104 produces a spacing index I according to each input speech frame s (n) _PWith a time lag value P ₀LP analysis module 106 is carried out linear prediction analysis to each input speech frame s (n), to produce a LP parameter alpha.This LP parameter alpha is offered LP quantization modules 110.LP quantization modules 110 is gone back receiving mode M, thereby carries out quantizing process in the mode that depends on pattern.LP quantization modules 110 produces a LP index I _LPWith a quantification LP parameter .LP analysis filter 108 also receives input speech frame s (n) except receiving this quantification LP parameter .LP analysis filter 108 produce an expression input speech frame s (n) and the voice that reproduce according to this quantized linear prediction parameter between the LP residual signal R[n of error].LP residual signal R[n], pattern M and quantize LP parameter and offer MDLP residue scrambler 112 together.According to these numerical value, MDLP residue scrambler 112 is according to produce a residue index I below with reference to the described step of Fig. 4 process flow diagram _RWith a quantification residual signal [n].

In Fig. 3, demoder 200 that can be used for audio coder ﹠ decoder (codec) comprises a LP parameter decoder module 202, residue decoder module 204, a mode decoding module 206 and a LP composite filter 208.Mode decoding module 206 receives and the mode index I to producing from pattern M _MAnd decode.LP parameter decoder module 202 receiving mode M and LP index I _LPLP parameter decoder module 202 is decoded to these numerical value that receive and is quantized LP parameter to produce one.Residue decoder module 204 receives residue index I _R, spacing index I _PWith mode index I _MResidue decoder module 204 is also decoded to the numerical value that these receive and to be quantized residual signal to produce one

[n].This quantizes residual signal [n] and quantification LP parameter offer LP composite filter 208 together, a therefrom synthetic decoding output voice signal  [n].

Except that MDLP residue scrambler 112, the running of a plurality of modules and to be implemented in be known in the art in the demoder 200 among scrambler 100 and Fig. 3 among Fig. 2, and in aforesaid U.S. Patent numbering 5,414,796 and " digital processing of voice signal " (1978) 396-453 page or leaf of L.B.Rabiner and R.W.Schafer in describe to some extent.

According to an embodiment, the step in MDLP scrambler (not shown) execution graph 4 process flow diagram.This MDLP scrambler can be the MDLP residue scrambler 112 among Fig. 2.This MDLP scrambler checking mode M is full rate (FR) in step 300.Or 1/4th speed (QR) or 1/8th speed (ER).If pattern M is FR, QR or ER, then this MDLP scrambler just forwards step 302 to.In step 302, this MDLP scrambler puts on residue index I to corresponding speed (FR, QR or ER---depend on the value of M) _RBeing high precision, two-forty encoding and decoding to the FR pattern, and the time domain encoding and decoding of advantageously CELP encoding and decoding put on a LP residue frame, perhaps on the other hand, put on a speech frame.Send this frame (after the further signal Processing that comprises digital-to-analog conversion and modulation) then.In one embodiment, this frame is the LP residue frame of an expression predicated error.In another embodiment, this frame is the speech frame of an expression phonetic sampling.

On the other hand, if pattern M is not FR, QR or ER (that is, if pattern M is half rate (HR)) in step 300, then this MDLP scrambler just forwards step 304 to.In step 304, with half rate the frequency spectrum encoding and decoding of advantageously harmonic wave encoding and decoding are put on the LP residue, or on the other hand, put on voice signal.This MDLP scrambler forwards step 306 to then.In step 306,, can obtain a distortion measurement D by encoded voice being decoded and it being compared with original incoming frame.Then, this MDLP scrambler forwards step 308 to.In step 308, this distortion measurement D and a predetermined threshold T are compared.If distortion measurement D greater than predetermined threshold T, then modulates and sends the corresponding quantization parameter with the frame of half rate spectrum coding.On the other hand, if distortion measurement D is not more than threshold value T, then this MDLP scrambler forwards step 310 to.In step 310, at full speed rate in time domain to this frame recompile of having decoded.Can advantageously use any conventional two-forty, high precision code decode algorithm, for example CELP encoding and decoding.Then, modulate and send the FR pattern quantization parameter that is associated with this frame.

As shown in Fig. 5 process flow diagram, follow one group of step when the phonetic sampling that processing is used to transmit according to a kind of closed-loop multimode MDLP audio coder ﹠ decoder (codec) of an embodiment.In step 400, audio coder ﹠ decoder (codec) receives the digital sampling of the voice signal in the successive frame.After accepting particular frame, this audio coder ﹠ decoder (codec) execution in step 402.In step 402, this audio coder ﹠ decoder (codec) detects the energy of this frame.Energy is a kind of measured value to the voice activity of frame.Be summed square with the digitaling speech sampling amplitude, and with result's energy and certain threshold and carry out speech detection.In one embodiment, this threshold value is according to the variation level of ground unrest and adaptation.In aforesaid U.S. Patent numbering 5,414,796, a kind of exemplary variable thresholding speech activity detector has been described.Some unvoiced speech may be very low-energy sampling, may be used as ground unrest mistakenly and encode.For preventing this from occurring, the United States Patent (USP) numbering 5,414 as described above, described in 796, can use the spectral tilt of low-yield sampling to distinguish unvoiced speech and ground unrest.

Behind the energy that detects frame, this audio coder ﹠ decoder (codec) forwards step 404 to.In step 404, this audio coder ﹠ decoder (codec) determines whether the energy of detected frame is enough to this frame classification for comprising voice messaging.If the energy of detected frame is lower than intended threshold level, then this audio coder ﹠ decoder (codec) forwards step 406 to.In step 406, this audio coder ﹠ decoder (codec) is used as this frame as ground unrest (be non-voice or mourn in silence) and is encoded.In one embodiment, with 1/8 speed or 1kbps background noise frames is carried out time domain coding.If in step 404, the energy of detected frame meets or surpasses intended threshold level, be this frame classification voice then, and this audio coder ﹠ decoder (codec) forwards step 408 to.

In step 408, this audio coder ﹠ decoder (codec) determines whether this frame is periodic.For example, multiple known periodicity is determined that method comprises and is utilized zero crossing and utilize normalized autocorrelation functions (NACF).Especially, having described in the U.S. Patent application sequence numbering 08/815354 (transferred the possession of and given assignee of the present invention, and all be enclosed in for your guidance) that is called " method and apparatus of carrying out the variable bit rate voice encoding/decoding of changing down " that proposed on March 11st, 1997 utilizes zero crossing and NACF to come sense cycle.In addition, above-mentioned method in order to differentiation speech sound and unvoiced speech also is bonded among industry tentative standard TIA/EIA IS-127 of telecommunications industry association and the TIA/EIA IS-733.If in step 408, determine that this frame is not periodic, then this audio coder ﹠ decoder (codec) forwards step 410 to.In step 410, this audio coder ﹠ decoder (codec) is used as this frame as unvoiced speech and is encoded.In one embodiment, with 1/4 speed or 2kbps the unvoiced speech frame is carried out time domain coding.If it is periodic determining this frame in step 408, then this audio coder ﹠ decoder (codec) forwards step 412 to.

In step 412, this audio coder ﹠ decoder (codec) uses as the aforementioned, and for example, the known periods detection method of the prior art described in the above-mentioned U.S. Patent application sequence numbering 08/815,354 determines whether this frame is fully periodically.If it is not fully periodic determining this frame, then this audio coder ﹠ decoder (codec) forwards step 414 to.In step 414, this frame is used as transition voice (promptly from unvoiced speech to the speech sound transition) carries out time domain coding.In one embodiment, rate or 8kbps carry out time domain coding to the transition speech frame at full speed.

If in step 412, audio coder ﹠ decoder (codec) determines that this frame is fully periodic, and then this audio coder ﹠ decoder (codec) forwards step 416 to.In step 416, this audio coder ﹠ decoder (codec) is used as this frame as speech sound and is encoded.In one embodiment, with half rate or 4kbps the speech sound frame is carried out spectrum coding.Advantageously, use below with reference to the described a kind of harmonic wave codec of Fig. 7 this speech sound frame is carried out spectrum coding.On the other hand, can use in the art, other known frequency spectrum codec excites codec also can select for use as sine transform codec or multiband.Then, this audio coder ﹠ decoder (codec) forwards step 418 to.In step 418, this audio coder ﹠ decoder (codec) is decoded to the speech sound frame of coding.Then, this audio coder ﹠ decoder (codec) forwards step 420 to.In step 420, the speech sound frame of this decoding is compared with the corresponding input phonetic sampling of this frame, obtaining the measured value of a synthetic back voice distortion, and determine whether this kind half rate speech sound frequency spectrum coding is operated in the acceptable boundary.Then, this audio coder ﹠ decoder (codec) forwards step 422 to.

In step 422, whether the error between the speech sound frame of the definite decoding of this audio coder ﹠ decoder (codec) and the corresponding input phonetic sampling of this frame is lower than predetermined threshold value.According to an embodiment, use is finished this definite below with reference to the described mode of Fig. 6.If this coding distortion is lower than this predetermined threshold value, then this audio coder ﹠ decoder (codec) forwards step 424 to.In step 424, this audio coder ﹠ decoder (codec) is used as this frame as speech sound, uses the parameter in the step 416 to send.If this encoding error meets or surpass this predetermined threshold value in step 422, then this audio coder ﹠ decoder (codec) forwards step 414 to, this frame of the digitaling speech sampling that will in step 400, receive be used as the transition voice at full speed rate carry out time domain coding.

It may be noted that step 400-410 has constituted an open loop coding determinating mode, and on the other hand, step 412-426 has constituted a closed loop coding determinating mode.

As shown in Figure 6, in one embodiment, closed-loop multimode MDLP audio coder ﹠ decoder (codec) has comprised the analogue-to-digital converters (A/D) 500 that are connected with a frame buffer zone 502, and frame buffer zone 502 is connected with a processor controls 504 again.Energy calculator 506, speech sound detecting device 508, ground unrest scrambler 510, a two-forty time domain coding device 512 also all are connected with this processor controls 504 with a low rate spectrum coding device 514.A frequency spectrum demoder 516 is connected with spectrum coding device 514, and an Error Calculator 518 is connected with processor controls 504 with frequency spectrum demoder 516.A threshold value comparer 520 is connected with processor controls 504 with Error Calculator 518, and an impact damper 522 is connected with spectrum coding device 514, frequency spectrum demoder 516 and threshold value comparer 520.

In the embodiment shown in Fig. 6, in audio coder ﹠ decoder (codec), advantageously realize the parts of these language codecs, and audio coder ﹠ decoder (codec) itself resides in preferably among a DSP or the ASIC as firmware or other software-driven module.Those skilled in the art that can understand that these audio coder ﹠ decoder (codec) parts also can be realized with many other known methods with being equal to.Advantageously, processor controls 504 can be a microprocessor, but also can realize with controller, state machine or a discrete logic circuitry in addition.

In the multimode codec shown in Fig. 6, voice signal is provided for A/D 500.A/D 500 becomes analog signal conversion some frames of digitaling speech sampling S (n).These digitaling speech samplings are offered frame buffer 502.Processor controls 504 is extracted these digitaling speech samplings from frame buffer 502, and provides it to energy calculator 506.506 ENERGY E of calculating each phonetic sampling according to following formula of energy calculator:

E = Σ_{n = 0}^{159} S^{2} (n)

Wherein the length of frame is 20ms, and sample frequency is 8kHz.The ENERGY E that calculates is sent back to processor controls 504 again.

Processor controls 504 is compared the school with this speech energy that calculates with the voice activity threshold value.If this energy that calculates is lower than this voice activity threshold value, then processor controls 504 is directed to ground unrest scrambler 510 from frame buffer 502 with these digitaling speech samplings.Ground unrest scrambler 510 uses to keeping the necessary least number of bits of the estimation of ground unrest is encoded to these frames.

If this energy that calculates is more than or equal to this voice activity threshold value, then processor controls 504 is directed to speech sound detecting device 508 with these digitaling speech samplings from frame buffer 502.Speech sound detecting device 508 determines whether the periodicity of these speech frames can allow to use certain low bit rate spectrum coding to carry out effective encoding and decoding.The method of determining periodicity level in the speech frame is known in the art, for example, comprises and utilizes normalized autocorrelation functions (NACF) and zero crossing.These methods and other method have been described in aforesaid U.S. Patent application sequence numbering 08/815,354.

Speech sound detecting device 508 offers processor controls 504 with signal, points out that whether the voice that this speech frame comprises have enough cycles, carry out efficient coding for spectrum coding device 514.If speech sound detecting device 508 determines that this speech frame lacks enough cycles, then processor controls 504 is directed to high-rate coded device 512 with these digitaling speech samplings, and high-rate coded device 512 carries out time domain coding with predetermined maximum data rate to these voice.In one embodiment, this predetermined maximum data rate is 8kbps, and this high-rate coded device 512 is CELP codecs.

If speech sound detecting device 508 initial definite these voice signals have enough cycles of carrying out efficient coding for spectrum coding device 514, then processor controls 504 is directed to spectrum coding device 514 from frame buffer zone 502 with these digitaling speech samplings.To describe a kind of exemplary spectrum coding device in detail with reference to Fig. 7 below.

Spectrum coding device 514 is extracted and is estimated spacing frequency F ₀, the spacing frequency the amplitude A of harmonic wave ₁With speech information V _cSpectrum coding device 415 offers impact damper 522 and frequency spectrum demoder 516 with these parameters.Advantageously frequency spectrum demoder 516 can be similar with the demoder in the traditional C ELP scrambler.Frequency spectrum demoder 516 produces the synthetic speech sampling according to certain frequency spectrum codec format (being described below with reference to Fig. 7) (n), and with this synthetic speech sampling offer Error Calculator 518.Processor controls 504 sends to Error Calculator 518 with phonetic sampling S (n).

Error Calculator 518 is calculated each phonetic sampling S (n) and each corresponding synthetic speech sampling according to following formula

(n) square error between (MSE):

MSE = Σ_{n = 0}^{159} {(S (n) - \hat{S} (n))}^{2}

The MSE that calculates is offered threshold value comparer 520, but threshold value comparer 520 is determined level of distortion whether in acceptance limit, promptly whether level of distortion is lower than predetermined threshold value.

But if the MSE that calculates in acceptance limit, then threshold value comparer 520 offers impact damper 502 with signal, and the data that output spectrum has been encoded from audio coder ﹠ decoder (codec).On the other hand, if but MSE not in acceptance limit, then threshold value comparer 520 offers processor controls 504 with signal, processor controls 504 is directed to two-forty time domain coding device 512 from impact damper 522 with digital sampling successively.Time domain coding device 512 is encoded to these frames with predetermined maximum rate, and abandons the content in the buffer zone 522.

In the embodiment shown in fig. 6, employed frequency spectrum code/decode type is the following harmonic wave encoding and decoding of being described with reference to Fig. 7, but also can select the frequency spectrum encoding and decoding of any kind, excites encoding and decoding as sine transform encoding and decoding or multiband.As numbering in 5,195,166 the usage that multiband excites encoding and decoding is described at United States Patent (USP), and as in United States Patent (USP) numbering 4,865,068, describing the usage of sine transform encoding and decoding.

Be equal to or less than the sound frame of cyclic parameter at transition frames and phase distortion threshold value, rate or 8kbps advantageously use the CELP encoding and decoding to the multimode codec among Fig. 6 at full speed by two-forty time domain coding device 512.On the other hand, at such frame, can use any other known two-forty time domain encoding and decoding form.So, to transition frames (with enough periodic sound frame not) encoding and decoding, so that its input and output waveform just can mate well with high precision, and phase information also can keep well.In one embodiment, after having handled threshold value and surpassing the continuous sound frame of the predetermined quantity of its periodic measurement value, what need not threshold ratio school device 520 determines that the multimode codec just switches to full rate CELP encoding and decoding to a certain frame from the encoding and decoding of half rate frequency spectrum.

It may be noted that energy computing machine 506, speech sound detecting device 508 with processor controls 504, constituted the open loop coding and judged.Otherwise spectrum coding device 514, frequency spectrum demoder 516, Error Calculator 518, threshold value comparer 520, impact damper 522 have constituted a closed loop coding and have judged with processor controls 504.

In with reference to figure 7 described embodiment, use frequency spectrum encoding and decoding (advantageously harmonic wave encoding and decoding) abundant periodic sound frame to be encoded with low bit rate.The frequency spectrum codec is generally defined by some algorithms, and these algorithms are attempted to keep the time-evolution of voice spectrum feature in the mode of certain perception meaning by every frame voice being simulated and encoding in frequency domain.The essential part of such algorithm has: (1) spectrum analysis or parameter estimation; (2) parameter quantification; And (3) synthesize the output speech waveform with the parameter of decoding.Therefore, its target is exactly the key character that keeps the short-term voice spectrum with one group of frequency spectrum parameter, these parameters is encoded, then with the synthetic output of these frequency spectrum parameters of having decoded voice.Generally, synthetic output voice are as a kind of weighted sum of sine wave.This sine wave amplitude, frequency and phase place are exactly the frequency spectrum parameter of estimating during analyzing.

Although " pass through synthesis analysis " in the CELP encoding and decoding has been a kind of well-known technology, does not use this technology in the frequency spectrum encoding and decoding.The main cause that is not applied to the frequency spectrum codec by synthesis analysis is losing of initial phase information.Although from the viewpoint of certain perception, speech model running is appropriate, synthetic speech all can (MSE) still may be very high.Therefore, another advantage that accurately produces initial phase is exactly voice and the direct institute's generation ability relatively of phonetic sampling that acquisition will be reproduced, to allow determining whether this kind speech model can encode exactly to speech frame.

In the frequency spectrum encoding and decoding, synthetic output speech frame as:

S[n]＝S _v[n]+S _uv[n]，n＝1，2，…，N

Wherein N is the quantity of sampling quantity of every frame, S _vAnd S _UvBe respectively sound and noiseless part.It is as follows that " sine wave adds up " building-up process is set up sound part:

S [n] = Σ_{k = 1}^{L} A (k, n) \cdot \cos ({2 πnf}_{k} + θ (k, n))

Wherein L is sinusoidal wave total quantity, f _kBe the interested frequency in the short-term spectrum, A (k n) is sinusoidal wave amplitude, and θ (k n) is sinusoidal wave phase place.These amplitudes, frequency and phase parameter all are to estimate from the short-term spectrum of incoming frame by the spectrum analysis process.Noiseless part can be set up with sound part in certain single " sine wave adds up " is synthetic, perhaps can calculate separately by certain special " no phonosynthesis " process, is added back to S then _vIn.

In the embodiment of Fig. 7, use a kind of specific type frequency spectrum codec that is called as the harmonic wave codec abundant periodic sound frame to be carried out spectrum coding with low bit rate.The harmonic wave codec is analyzed the small fragment of a frame, with this frame be characterized as some sine waves and.Sinusoidal wave and in each sine wave all have a frequency, this frequency is the spacing F of this frame ₀Integral multiple.In an alternative embodiment, specific type frequency spectrum codec no longer is the harmonic wave codec, and the sine wave freuqency of each frame all extracts to the real number 2 π from one group 0.In the embodiment shown in fig. 7, advantageously select each sinusoidal wave amplitude and phase place in the summation, can on one-period, mate best to cause this summation, shown in the curve map of Fig. 8 with signal.The harmonic wave codec generally uses certain external sort, is labeled as each input speech frame sound or noiseless.For certain sound frame, its sinusoidal wave frequency all is restricted to estimated spacing (F ₀) harmonic wave, i.e. f _k=kF ₀For silent frame, use the crest of its short-term spectrum to be determined sinusoidal wave.Interpolation amplitude and frequency, simulating their differentiation on frame, as:

A(k，n)＝C ₁(k)*n+C ₂(k)

θ (k, n)=B ₁(k) * n ²+ B ₂(k) * n+B ₃(k) wherein, upwards opened characteristic frequency position f outside the short-term Fourier transform (STFT) of input speech frame of window _k(=kf ₀) the instantaneous value estimation coefficient [C of amplitude, frequency and the phase place located _i(k), B _i(k)].The parameter that each sine wave will send is amplitude and frequency.Do not send phase place, but according to any simulation the in several known technologies (as the quadratic phase simulation).

As shown in FIG. 7, certain harmonic wave codec comprises a spacing extractor 600, and it is connected with discrete Fourier transform (DFT) (DFT) and frequency analysis logical circuit 604 with window logical circuit 602.Receive the spacing extractor 600 of phonetic sampling S (n), also be connected with DFT and frequency analysis logical circuit 604 as input.DFT and frequency analysis logical circuit 604 are connected with a residue scrambler 606.Each all is connected spacing extractor 600, DFT and frequency analysis logical circuit 604 and residue scrambler 606 with a parameter quantification device 608.Parameter quantification device 608 is connected with a channel encoder 610 again, and channel encoder 610 then is connected with a transmitter 612.Transmitter 612 is connected with a receiver 614 by certain standard radio frequency (RF) interface (for example, stifled as certain CDMA (CDMA) air interface).Receiver 614 is connected with a channel decoder 616, and channel decoder 616 is connected with a removal quantizer 618.Removing quantizer 618 is connected with " sine wave an adds up " voice operation demonstrator 620.What be connected with " sine wave adds up " voice operation demonstrator 620 also has one to receive former frame information as the phase estimating device of importing 622.Configuration " sine wave adds up " voice operation demonstrator 620 is to produce synthetic speech output S _SYNTH(n).

Spacing extractor 600, window logical circuit 602, DFT and frequency analysis logical circuit 604, residue scrambler 606, parameter quantification device 608, channel encoder 610, channel decoder 616, remove the different ways that quantizer 618, " sine wave adds up " voice operation demonstrator 620 and phase estimating device 622 can be known with those skilled in the art that crowd, as comprise firmware or software module, realized.Transmitter 612 and receiver 614 can be realized with the standard RF parts of any equivalence.

In harmonic wave codec shown in Figure 7, spacing extractor 600 receives input sampling S (n), and the spacing frequency information F that extracts ₀Window logical circuit 602 multiply by suitable window function by sampling then, with permission the small fragment of speech frame is analyzed.The pitch information of utilizing spacing extractor 608 to provide, DFT and frequency analysis logical circuit 604 calculate the DFT of sampling, to produce the complex spectrum point, can be from complex spectrum point from the harmonic amplitude A that extracts ₁, shown in the curve map of Fig. 8, in Fig. 8, L represents the total quantity of harmonic wave.DFT is offered residue scrambler 606, the residue scrambler 606 acoustic information V that extracts _c

Be noted that as shown in Figure 8 V parameter _cA point on the expression frequency axis, frequency spectrum above it is the feature of unvoiced sound signal, and no longer is harmonic wave.On the contrary, at V _cThe frequency spectrum of some below is the humorous feature that involves the speech sound signal.

A ₁, F ₀And V _cComposition offers the parameter quantification device 608 that these information is quantized.These quantitative informations are offered channel encoder 610 with the form of grouping, and channel encoder 610 is with for example, such as the low bit rate of half rate or 4kbps it quantized.These groupings are provided for transmitter 612 then, and transmitter 612 is modulated it and from consequential signal being sent to receiver 614 in the air.Receiver 614 receives and this signal of demodulation, gives channel decoder 616 with the packet delivery of coding.Channel decoder 616 is decoded to these groupings, and the grouping that will decode offers removal quantizer 618.Remove quantizer 618 these information are removed quantification.Then these information are provided for " sine wave adds up " voice operation demonstrator 620.

Configuration " sine wave adds up " voice operation demonstrator 620 with according to above-mentioned about S[n] formula the sine wave of a plurality of simulation short-term voice spectrums is synthesized.The frequency f that these are sinusoidal wave _kBe basic frequency F ₀Multiple or harmonic wave, and basic frequency F ₀It is the frequency in the spacing cycle of quasi periodic (being transition) speech sound fragment.

" sine wave adds up " voice operation demonstrator 620 is also from phase estimating device 622 receiving phase information.Phase estimating device 622 receives the information of previous frame, the A of the former frame that promptly is right after ₁, F ₀And V _cParameter.Phase estimating device 622 also receives N the sampling that previous frame is reproduced, and wherein N is the length (being that N is the quantity of sampling quantity of every frame) of frame.Phase estimating device 622 is determined the initial phase of frame according to the information of previous frame.The initial phase of determining is offered " sine wave adds up " voice operation demonstrator 620.Calculate according to the information of present frame and the initial phase that is carried out according to past frame information by phase estimating device 622, " sine wave adds up " voice operation demonstrator 620 produces synthetic speech frame, as mentioned above.

As mentioned above, the harmonic wave codec changes to a frame linearity ground from a frame by information and the predicted phase of utilizing previous frame.And it is synthetic or reproduce to speech frame.In the synthetic model of above-mentioned common being called " quadratic phase model ", coefficient B ₃(k) initial phase of the synthetic current sound frame of expression.When determining phase place, conventional harmonic wave codec or initial phase are set to zero, or produce an initial phase randomly or by some pseudorandom production methods.For predicted phase more exactly, phase estimating device 622 is confirmed as speech sound frame (promptly fully periodically frame) according to the previous frame that is right after or the transition speech frame uses one of two kinds of possible methods to determine initial phase.If the frame of front is a speech sound frame, then use the prima facies place value of the final estimation phase value of a frame as present frame.On the other hand, if be the frame classification of front a transition frames, then from obtain the prima facies place value of present frame by the frequency spectrum of the decoding output of previous frame being carried out the previous frame that DFT obtains.So, phase estimating device 622 has just utilized the accurate phase information that existed (because being that rate is handled as the frame of the front of transition frames at full speed).

In certain embodiment, a kind of closed-loop multimode MDLP audio coder ﹠ decoder (codec) is followed the speech processes step described in Fig. 9 process flow diagram.This audio coder ﹠ decoder (codec) comes the LP residue of each input speech frame is encoded by selecting the most appropriate coding mode.Some pattern is encoded to LP residue or speech residual in time domain, and other patterns are represented LP residue or speech residual in frequency domain.The set of these patterns is: press full rate, time domain (T pattern) at transition frames; Press half rate, frequency domain (V model) at sound frame; Press 1/4th speed, time domain (U pattern) at silent frame; Press 1/8th speed, time domain (N pattern) at noise frame.

Those skilled in the art that will appreciate that voice signal or corresponding LP residue can be followed the step shown in Fig. 9 and encode.Can all regard the waveform character of noise, noiseless, transition and speech sound as in the curve map of Figure 10 A the function of time.Can regard noise, noiseless, transition and sound LP residue as in the curve map of Figure 10 B the function of time.

In step 700, make about the open loop mode that any in four kinds of patterns (T, V, U or N) is applied to import speech residual S (n) and judging.If use the T pattern, then in step 702, remain by T pattern (be full rate, in time domain) processed voice.If use the U pattern, then in step 704, press U pattern (i.e. 1/4th speed, in time domain) processed voice and remain.If use the N pattern, then in step 706, press N pattern (i.e. 1/8th speed, in time domain) processed voice and remain.If the application V model, then in step 708 by V model (be half rate, in frequency domain) processed voice residue.

In step 710, the voice of coding are decoded in the corresponding step 708, and it is compared with input speech residual S (n), and calculated performance measured value D.In step 712, this performance measurement D is compared with predetermined threshold value T.If performance measurement D more than or equal to threshold value T, then permits being sent in the speech residual of step 708 intermediate frequency spectrum coding in step 714.On the other hand, if performance measurement D, then presses T mode treatment input speech residual S (n) less than threshold value T in step 716.In certain alternative embodiment, do not calculate performance measurement, do not define threshold value yet.But behind the speech residual frame of having handled predetermined quantity by V model, just press T mode treatment next frame.

Advantageously, determination step shown in Fig. 9 allows to have only the T pattern of just using high bit rate where necessary, and the V model of low bit rate is used for the periodicity of speech sound fragment, in the time can't carrying out V model suitably, switch to full rate simultaneously, to prevent any decline of quality.Correspondingly, just can produce a kind of very high speech quality with certain mean speed that significantly is lower than full rate near the full rate speech quality.And, can also come controlled target speech quality by selected performance measurement and selected threshold value.

By making analogue phase follow the tracks of the Phase Tracking that keeps near the input voice, " renewal " of T pattern also improved the performance of V model subsequent applications.When the performance of V model is inappropriate, step 710 and 712 closed-loop characteristic inspection just switch to the T pattern, thereby by " refreshing " prima facies place value, allow analogue phase follow the tracks of the Phase Tracking of close original input voice once more, improved the performance that follow-up V model is handled.By the example as shown in the curve map of Figure 11 A-C,, in V model, carry out inappropriately from the 5th frame of beginning as seen by used PSNR distortion measurement.Thereby if there is not closed loop to judge and renewal, analogue phase is followed the tracks of will significantly depart from the Phase Tracking of original input voice, cause the violent reduction of the PSNR shown in Figure 11 C.And the performance of the subsequent frame of handling under V model has all reduced.But, under closed loop is judged, the 5th frame is switched to the T mode treatment like that shown in Figure 11 A.Can find out that the performance of the 5th frame has obtained significant raising by renewal from the raising of PSNR shown in Figure 11 B.In addition, also improved the performance of the subsequent frame of in V model, handling.

By initial phase estimated value very accurately is provided, guarantee that the synthetic signal of speech residual as a result of V model accurately aligns in time with original input speech residual S (n), determination step shown in Figure 9 has improved the expression quality of V model.Obtain the initial phase of the speech residual fragment of first V model processing in the following manner from a last decoded frame that is right after.For each harmonic wave, if previous frame is handled under V model, then its initial phase is set to equal the final estimation phase place of previous frame.For each harmonic wave, if previous frame is handled under the T pattern, then its initial phase is set to equal the actual harmonic phase of previous frame.By using the whole previous frame remaining DFT that decodes in the past to obtain the actual harmonic phase of previous frame.On the other hand, also can remain the actual harmonic phase of carrying out DFT and obtaining previous frame with the spacing method of synchronization to the past decoding by handling a plurality of spacing cycles of previous frame.

Closed-loop multimode mixed-domain linear prediction (MDLP) audio coder ﹠ decoder (codec) of a kind of novelty so, has been described.Those skilled in the art that crowd knows, can realize or carry out multiple declaration logical block and the algorithm steps relevant with the embodiment that reveals at this with digital signal processor (DSP), special IC (ASIC), discrete gate or transilog, the discrete hardware components as register and FIFO and so on, the processor of carrying out one group of firmware instructions or any conventional programmable software modules and processor.Advantageously, processor can be a microprocessor, but on the other hand, this processor also can be any conventional processors, controller, microcontroller or state machine.Software module can reside in RAM storer, flash memory, register or any other form known in the art can write medium.Those skilled in the art that can further understand, more than data, instruction, order, information, signal, bit, symbol and the chip quoted in the whole description, can advantageously be represented with voltage, electric current, electromagnetic wave, magnetic field or magnetic particle, light field or light particles or their any combination.

Preferred embodiment of the present invention is disclosed thus and is described., be familiar with in the art that the personnel of basic fundamental can understand, can do many changes to the embodiment that discloses herein and without departing from the spirit and scope of the present invention.So, unless according to following as claim, the present invention is unrestricted.

Claims

1. a multimode mixed-domain speech processor is characterized in that, comprising:

Codec, it has at least a time domain encoding/decoding mode and at least a frequency domain encoding/decoding mode; With

The closed loop mode selecting arrangement, it links to each other with codec, and the content that is configured to according to the handled frame of speech processor is that codec is selected encoding/decoding mode.

2. speech processor as claimed in claim 1 is characterized in that codec is encoded to speech frame.

3. speech processor as claimed in claim 1 is characterized in that, codec is encoded to the linear prediction residue of speech frame.

4. speech processor as claimed in claim 1, it is characterized in that, at least a time domain encoding/decoding mode comprises a kind of encoding/decoding mode that frame is carried out encoding and decoding by first kind of code rate, and at least a frequency domain encoding/decoding mode comprises a kind of encoding/decoding mode that frame is carried out encoding and decoding by second kind of code rate, and second kind of code rate is less than first kind of code rate.

5. speech processor as claimed in claim 1 is characterized in that, at least a frequency domain encoding/decoding mode comprises a kind of harmonic wave encoding/decoding mode.

6. speech processor as claimed in claim 1, it is characterized in that, further comprise the comparator circuit that links to each other with codec, be used for comparing to coded frame not with by the frame of at least a frequency domain encoding/decoding mode coding, and according to comparative result generation performance measurement, wherein, when having only this performance measurement to be lower than predetermined threshold, codec is just used at least a time domain encoding/decoding mode, otherwise codec is used this at least a frequency domain encoding/decoding mode.

7. speech processor as claimed in claim 1 is characterized in that, after handling continuously frame with the encoding and decoding of at least a frequency domain encoding/decoding mode and reaching a certain predetermined quantity, codec is used at least a time domain encoding/decoding mode to a back to back frame.

8. speech processor as claimed in claim 1, it is characterized in that, at least a frequency domain encoding/decoding mode comprises frequency for one group with a plurality of each tool, the sine wave of the parameter of phase place and amplitude is represented the short-term spectrum of each frame, wherein phase place is simulated by a multi-term expression and a prima facies place value, wherein prima facies place value or (1) if former frame with the encoding and decoding of at least a frequency domain encoding/decoding mode, then get the final estimation phase value of former frame, or (2) are if former frame with these at least a time domain encoding/decoding mode encoding and decoding, is then got certain phase value that obtains from the short-term spectrum of former frame.

9. speech processor as claimed in claim 8 is characterized in that, the sine wave freuqency of each frame is the integral multiple of the spacing frequency of this frame.

10. speech processor as claimed in claim 8 is characterized in that the sine wave freuqency of each frame extracts to the real number 2 π from one group 0.

11. the described method of processed frame is characterized in that, comprises the following steps:

Each continuous incoming frame is used open loop encoding/decoding mode selection course, to select a kind of time domain encoding/decoding mode or a kind of frequency domain encoding/decoding mode according to the voice content of incoming frame;

If the voice content of incoming frame is expressed as the speech sound of steady state (SS), then this incoming frame is carried out the frequency domain encoding and decoding;

If the voice content of incoming frame is expressed as any other content except that the steady state (SS) speech sound, then this incoming frame is carried out the time domain encoding and decoding;

Compare frame and incoming frame, to obtain a performance measurement with the frequency domain encoding and decoding; With

If this performance measurement is lower than predetermined threshold value, then this incoming frame is carried out the time domain encoding and decoding.

12. method as claimed in claim 11 is characterized in that, these frames all are the linear prediction residue frames.

13. method as claimed in claim 11 is characterized in that, these frames all are speech frames.

14. method as claimed in claim 11, it is characterized in that, time domain encoding and decoding step comprises by first kind of encoding and decoding speed carries out encoding and decoding to frame, and frequency domain encoding and decoding step comprises by second kind of encoding and decoding speed carries out encoding and decoding to frame, and second kind of encoding and decoding speed is less than first kind of encoding and decoding speed.

15. method as claimed in claim 11 is characterized in that, frequency domain encoding and decoding step comprises the harmonic wave encoding and decoding.

16. method as claimed in claim 11, it is characterized in that, frequency domain encoding and decoding step comprises with one group of a plurality of each tool and comprises that the sine wave of the parameter of frequency, phase place and amplitude represents the short-term spectrum of each frame, wherein phase place is simulated by a polynomial expression formula and a prima facies place value, wherein prima facies place value or (1) if former frame with the encoding and decoding of frequency domain encoding/decoding mode, then get the final estimation phase value of former frame, or (2) are if former frame with the encoding and decoding of time domain encoding/decoding mode, is then got certain phase value that obtains from the short-term spectrum of former frame.

17. method as claimed in claim 16 is characterized in that, the sine wave freuqency of each frame is the integral multiple of this frame pitch frequency.

18. method as claimed in claim 16 is characterized in that, the sine wave freuqency of each frame extracts to the real number 2 π from one group 0.

19. a multimode mixed-domain speech processor is characterized in that comprising:

Incoming frame is used open loop encoding/decoding mode selection course, to select the device of a kind of time domain encoding/decoding mode or a kind of frequency domain encoding/decoding mode according to the voice content of incoming frame;

If the voice content of incoming frame is expressed as the speech sound of steady state (SS), then this incoming frame is carried out the device of frequency domain encoding and decoding;

If the voice content of incoming frame is expressed as any other content except that the steady state (SS) speech sound, the device that then this incoming frame is carried out the time domain encoding and decoding;

Compare frame and incoming frame, to obtain the device of a performance measurement with the frequency domain encoding and decoding; With

If this performance measurement is lower than certain predetermined threshold value, then this incoming frame is carried out the device of time domain encoding and decoding.

20. speech processor as claimed in claim 19 is characterized in that, this incoming frame is a linear prediction residue frame.

21. speech processor as claimed in claim 19 is characterized in that, this incoming frame is a speech frame.

22. speech processor as claimed in claim 19, it is characterized in that, the device that is used for the time domain encoding and decoding comprises the device that frame is carried out encoding and decoding by first kind of encoding and decoding speed, the device that is used for the frequency domain encoding and decoding comprises the device that frame is carried out encoding and decoding by second kind of encoding and decoding speed, and second kind of encoding and decoding speed is less than first kind of encoding and decoding speed.

23. speech processor as claimed in claim 19 is characterized in that, the device that is used for the frequency domain encoding and decoding comprises the harmonic wave codec.

24. speech processor as claimed in claim 19, it is characterized in that, the device that is used for the frequency domain encoding and decoding comprises with a plurality of each tool and comprises a frequency for, the sine wave of the parameter of phase place and amplitude is represented the device of the short-term spectrum of each frame, wherein, phase place is simulated by a polynomial expression formula and a prima facies place value, wherein, prima facies place value or (1) if former frame with the encoding and decoding of frequency domain encoding/decoding mode, then get the final estimation phase value of former frame, or (2) are if former frame with the encoding and decoding of time domain encoding/decoding mode, is then got certain phase value that obtains from the short-term spectrum of former frame.

25. speech processor as claimed in claim 24 is characterized in that, the sine wave freuqency of each frame is the integral multiple of this frame pitch frequency.

26. speech processor as claimed in claim 24 is characterized in that the sine wave freuqency of each frame extracts to the real number 2 π from one group 0.