CN101572091A - Self-adapting multi-rate broadband coding method and coder - Google Patents

Self-adapting multi-rate broadband coding method and coder Download PDF

Info

Publication number
CN101572091A
CN101572091A CNA2008100368357A CN200810036835A CN101572091A CN 101572091 A CN101572091 A CN 101572091A CN A2008100368357 A CNA2008100368357 A CN A2008100368357A CN 200810036835 A CN200810036835 A CN 200810036835A CN 101572091 A CN101572091 A CN 101572091A
Authority
CN
China
Prior art keywords
frame
input signal
speech
signal frame
amr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100368357A
Other languages
Chinese (zh)
Inventor
向为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA2008100368357A priority Critical patent/CN101572091A/en
Publication of CN101572091A publication Critical patent/CN101572091A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a novel self-adapting multi-rate coder and a coding method thereof. Voice activation detection mainly aims at synthetic digital voice corresponding to a coding frame, and the coding rate is confirmed according to a voice activation detection result and a transmitting type of discontinuous transmission, thereby the voice transmitting rate is compressed. The self-adapting multi-rate coder and the coding method thereof are generally updated, i.e. the coder confirms an excitation signal of the coder according to the type of a transmitting frame and leads a voice signal compounded by an encoder to be capable of accurately reflecting the hearing effect of original voice. The invention can be directly applied to a voice coding technology of a third generation mobile communication system, i.e. a universal mobile communication system.

Description

A kind of AMR-WB coding method and scrambler
Technical field
The present invention relates to AMR-WB scrambler and coding method thereof, the voice activation that is specifically related to the AMR-WB scrambler detects and continuous voice signal frame is carried out the technology of AMR-WB coding.
Background technology
Code excited linear prediction coder has obtained using widely since 1985 are suggested.In the vocoder of CDMA (CDMA) and universal mobile telecommunications system (UMTS), all used the technology of code excited linear prediction coder.
Code Excited Linear Prediction has comprised linear prediction and quantification, self-adapting code book search and fixed codebook search.Because itself has quiet period voice, can be by reducing the transfer rate of the effective compressed voice data of data rate between these quiet period, the application number of Qualcomm is that the patent of 92104618.9 rate changeable vocoder is exactly a scheme about said method.
In UMTS, used adaptive multi-rate (AMR) voice coding, adaptive multi-rate (AMR) voice coding be 3GPP (3G (Third Generation) Moblie partner plan) formulate be applied to voice compression coding in the 3G (Third Generation) Moblie, adaptive multi-rate (AMR) voice coding is divided into self-adapting multi-rate narrowband (AMR-NB) voice coding, AMR-WB (AMR-WB) voice coding and AMR-WB modified (AMR-WB+) voice coding again, and these coding methods are all based on code book excitation linear linear forecasting technology.The code book excited linear prediction (CELP) coder that adopts in adaptive multi-rate (AMR) code encoding/decoding mode is divided into several subframes with a voice signal frame, carries out linear prediction and quantification, self-adapting code book search and quantification and fixed codebook search and quantification.AMR-WB (AMR-WB) voice coding is supported the code rate of the speech pattern of eight kinds of speed: 12.2,10.2,7.95,7.40,6.70,5.90,5.15, (4.75kb/s kilobits/second), and the code rate of the ground unrest pattern of low rate (1.80kb/s), the form 1 of the chapters and sections 5 of the TS26.071-500 of 3GPP (Table 1) has provided the encoder modes of corresponding above-mentioned these AMR-WB code rates: 23.85,23.05,19.85,18.25,15.85,14.25,12.65,8.85,6.6kb/s, and the ground unrest code rate of low rate (1.75kb/s), the form 1 of the chapters and sections 5 of the TS26.171-500 of 3GPP (Table 1) has provided the encoder modes of corresponding above-mentioned these AMR-WB code rates: AMR-WB_23.85, AMR-WB_23.05, AMR-WB_19.85, AMR-WB_18.25, AMR-WB_15.85, AMR-WB_14.25, AMR-WB_12.65, AMR-WB_8.85, AMR-WB_6.60 and AMR-WB_SID.
Linear prediction and quantification have comprised: the voice signal frame that sampling is obtained or form a sequence through pretreated voice signal frame, take advantage of sample sound in this sequence with a window function, so that the voice data frame of a windowing to be provided; Voice data frame by described windowing calculates one group of coefficient of autocorrelation; Calculate one group of linear predictor coefficient with Lai Wenxun-Du Bin (Levinson-Durbin) algorithm by described coefficient of autocorrelation batch total: described linear predictor coefficient group is transformed into another spectrum domain; Quantize the described coefficient sets that is transformed on another spectrum domain according to the speed in the coded order, for example, one group of line frequency spectrum on 10 rank is to the value of (LSP), or one group of acoustic reactance on 16 rank is received the value of frequency spectrum to (ISP), about the line frequency spectrum to (LSP), in the article in being published in international language voice and signal Processing meeting (ICASSP) ' 84 " the line frequency spectrum is to (LSP) and speech data compression " explanation is arranged, receive frequency spectrum to (ISP) about acoustic reactance, can receive frequency spectrum to acoustic reactance at 5.2.3 chapters and sections-linear predictor coefficient of the TS26190 of 3GPP and find explanation in to conversion (LP to ISP Conversion).
In the Qualcomm Code Excited Linear Prediction (QCELP) process, the best code book vector signal that self-adapting code book search and fixed codebook search obtain multiply by addition after separately the optimum gain, itself and be pumping signal.Pumping signal is must use in the cataloged procedure, and Qualcomm Code Excited Linear Prediction (QCELP) is the synthetic speech based on pumping signal of error minimum between search and the raw tone.
The TS26.190 of 3GPP is described the self-adapting code book search of AMR-WB, for example, and 5.7 joints of TS26.190-310 version.Self-adapting code book search has comprised the calculating that pumping signal before closed loop pitch (pitch) search based on former pumping signal and the interpolation of being undertaken by selected integer and mark pitch delay after this obtains self-adapting code book.The self-adapting code book parameter that the self-adapting code book search obtains is the self-adapting code book gain of pumping signal, integer and mark pitch delay, self-adapting code book gain and quantification.
Closed loop pitch searcher is to finish by the minimizing of all square weighted errors between raw tone and the reconstruct voice, described minimizing need be found out minimum all square weighted error the pairing all square weighted error of each delay value in the hunting zone, and the pairing all square weighted error of each delay value is determined the response of former pumping signal by self-adapting code book ferret out signal (target signal) and weighted synthesis filter (weighted synthesis filter).Concerning AMR-WB, the joint of 5.7 in the TS26.190-510 version of 3GPP illustrates this, is exactly the characteristic item T that finds the solution earlier by following formula (1) expression kInteger delay value k when maximum obtains best integer delay,
T k = Σ n = 0 63 x ( n ) y k ( n ) Σ n = 0 63 y k ( n ) y k ( n ) , - - - ( 1 )
Near best integer delay mark delay value also is by the normalized characteristic item T of interpolation kObtain, the maximum mark delay value of search can obtain best score to postpone, deposit the pumping signal value be the excitation impact damper (u (n), n=-(231+17) ... 63), value (u (n), the n=0 of same search phase, 1 ..., 63) also be linear residual error (LPresidual).Pumping signal value (u (n), n<0) before search phase in the excitation impact damper (excitation buffer) is the pumping signal value of former subframe.The pumping signal of each subframe is the signal that obtains after the self-adapting code book signal of current subframe amplifies by the self-adapting code book yield value that quantizes, obtain the signal resulting signal that superposes after amplifying by the fixed codebook gain value that quantizes with the fixed code book signal, about this point, also can be referring to 5.10 joints of TS26.190-510 version, its Chinese style (56) is the mathematical notation of pumping signal value.
AMR-WB (AMR-WB) voice coding has comprised the process that fixed codebook gain quantizes, fixed codebook gain quantizes to comprise: the prediction gain that obtains based on the quantification energy predicting error (quantified prediction error) of former subframe, and the quantification of the modifying factor between fixed codebook gain and the described prediction gain.The quantification energy predicting error of subframe (quantified prediction error) is exactly the value after the logarithm of described modifying factor amplifies by fixed proportion.
TS26.190 quantizes to be described to the fixed codebook gain of AMR-WB, for example, and 5.9 joints of TS26.0190-510 version.In formula (50) and (52), just Xia Mian formula (2) and (3) illustrate and quantize the how impact prediction gain of energy predicting error,
E ~ ( n ) = Σ i = 1 4 b i R ^ ( n - i ) - - - ( 2 )
g c ′ = 10 0.05 ( E ~ ( n ) + E ‾ - E i ) . - - - ( 3 )
Formula (2) is a n subframe prediction energy (predicted energy)
Figure A20081003683500073
Definition, value is moving average (MA) predictive coefficient for [the b1 b2 b3 b4] of [0.5 0.40.3 0.2], It is exactly the quantification energy predicting error of k subframe; Formula (3) is prediction gain (predicted gain) g ' cDefinition, E is that value is the mean value of the renewal energy (innovation energy) of 30 decibels (dB), E iBe on average to upgrade energy (mean innovation energy).Modifying factor between fixed codebook gain and the prediction gain is the ratio of the former with the latter; And the formula (53) in 5.9 joints of TS26.0190-510 version illustrate that energy predicting error R (n) 20 is multiplied by the logarithm of stating modifying factor, quantizes the energy predicting error and then is 20 and take advantage of the logarithm of quantification modifying factor.
The digital voice frame of sampled digital Speech frame through forming after the pre-service through linear prediction and quantification, self-adapting code book search and fixed codebook search after the resonance peak of formed synthetic digital Speech frame mainly determined by the employed linear prediction analysis of linear prediction (LPC), more definite, concerning AMR-WB, be exactly after ISP is converted to prediction (LP) coefficient, one 16 rank linear prediction synthesis filter also can be definite by formula (4), wherein
Figure A20081003683500075
(i=1 ..., m m=16) is prediction (LP) coefficient that has quantized.
H ( z ) = 1 A ^ ( z ) = 1 1 + Σ i = 1 m a ^ i z - i , - - - ( 4 )
For AMR-WB, is exactly synthetic digital Speech frame with pumping signal by the filtered output of linear prediction synthesis filter, so, the limit correspondence of linear prediction synthesis filter the frequency and the bandwidth of resonance peak of synthetic digital Speech frame, these resonance peaks are reflected on the intensity of the waveform on the time domain, and are very big to sense of hearing influence.
In AMR-WB (AMR-WB) the tone decoding process, each frame is all carried out LP (linear prediction) filter parameter decoding, thereby be formed for the LP filter coefficient of each subframe of the voice signal of each subframe of reconstruct; The building method of the pumping signal of each subframe is: the signal that obtains after the self-adapting code book signal is amplified by the self-adapting code book yield value, the signal that obtains after amplifying by the fixed codebook gain value with the fixed code book signal superposes, and self-adapting code book yield value here and fixed code book signal are the quantized values that the self-adapting code book gain index that obtains according to decoding and fixed code book index find from quantization table.The self-adapting code book signal of AMR-WB is based on the composite signal of the pumping signal of a subframe, promptly, the self-adaption of decoding codebook index obtain integer and mark pitch delay, by described integer and mark pitch delay the pumping signal of a last subframe is carried out interpolation and obtain base sound code vector signal v ' (n) (identical) with the expressions of 5.7 joints among the TS26.190-510 of 3GPP, come linear interpolation base sound code vector signal to obtain the self-adapting code book signal according to signal path parameter in the coded frame (in two paths) again, this signal path is that coding staff calculates and write AMR-WB coded frame (except that the 6.60kb/s pattern, this signal path is fixed as the second path in the 6.60kb/s pattern).In 5.7 joints of constructive method in the TS26.190-510 of 3GPP about the self-adapting code book signal detailed description is arranged.
According to be published in Proc.IEEE (progress. institute of electrical and electronic engineers) .1975,63 (4): the document of 561-580 " linear prediction: the review (Linear Prediction:A Tutorial Review) of the property of crossing the threshold " can be known, the position that the peakedness ratio of the spectrum envelope that the method for employing linear prediction obtains usually departs from real resonance peak near the harmonic wave peak value, that is to say that the spectrum envelope of the synthetic digital Speech frame that obtains according to linear prediction synthesis filter is not consistent with the spectrum envelope of original digital voice signal frame.
The author who published in 2004 in the Electronic Industry Press is that you 5.3.4 of auspicious " the discrete time voice signal is handled: principle and application (Discrete-Time Speech Signal Processing:Principle and Practice) " of quart of the U.S. saves---point out in Levinson (Lai Wenxun) recurrence and the correlation properties thereof: it is minimum phase system that employed all-pole modeling of linear prediction and autocorrelation method can make all limits of (7) formula drop in the unit circle; The phase function of the Fourier transform of separating of the correlation method of sequence is distortion; The auto-correlation of linear prediction causes the transformation of glottis maximum phase limit to the minimum phase limit; When setting up the synthetic speech waveform, the phase function distortion that the auto-correlation conversion causes may be influential to speech perception, that is, and and the departing from of the waveform of the waveform of synthetic digital voice signal and original digital voice signal.Point out in 5.6 joints at this book---the speech synthesis based on all-pole modeling: the composite signal based on the linear prediction correlation method looks like voice, but simultaneously owing to its minimum phase characteristic has lost the absolute phase structure; Shown in the example among Fig. 5 .18 in the book, the spike of reconstructed speech signal is more more outstanding than original signal, and the desirable glottis ripple that is assumed to minimum phase is the time upset, and has than the steeper rising edge of actual glottis ripple.
The voice activation of adaptive multi-rate vocoder detection (VAD) method is to calculate the level of pretreated input signal and the difference between the ground unrest estimated value earlier at present, calculate the VAD decision threshold again, the initial judgement of VAD realizes by more described difference and decision threshold, when the former initially adjudicates to Speech frame is arranged during greater than the latter, when the former during smaller or equal to the latter initial judgement be no Speech frame, the conclusive judgement of VAD is with the result of initially other detections such as judgement and the pretreated digital voice signal tone judgement after comprehensively.
The VAD of AMR-WB also will combine with discontinuous transmitting DTX, DTX is that the VAD result by a plurality of input signal frames detects the transmission that just begins to carry out discontinuous silence description frames SID after one section voice finishes, and the TS26.193 of 3GPP has introduced carrying into execution a plan of a kind of DTX.
The DTX requirement, when one section voice finishes, to need a plurality of (for example 8) successive frame to remove to produce a SID frame, promptly will be continuously a plurality of (for example 7) VAD result frame (for example the 8th frame) afterwards is encoded to SID_FIRST to indicate the end of one section voice for the input signal frame of no speech after with speech pattern code rate coding, in case the SID_FIRST frame is sent out, as long as continuous no voice (for example per 8 frames) transmission SID_UPDATE frame periodically just, first SID_UPDATE frame need send out at the particular moment behind the SID_FIRST frame (for example the 3rd frame); A kind of exception is that the VAD result of an input signal frame behind the input signal frame of voice is no speech and finishes to be less than certain hour (for example 24 frames) apart from the preceding paragraph voice this frame is encoded to the SID_FIRST frame.
Summary of the invention
The technical matters that solves
Synthetic digital Speech frame that coded frame generated that is produced according to the AMR coding that adopts the Code Excited Linear Prediction technology and the phonetic feature of former digital voice signal frame and inconsistent, in background technology, point out to some extent about this point, that is: estimate that with the linear prediction analysis method peak that resulting spectrum envelope usually takes place resonance peak departs from real resonance peak; Employed all-pole modeling of linear prediction and autocorrelation method can make all limits of model drop in the unit circle, thereby cause the phase function distortion of the Fourier transform of synthetic digital voice signal, this can make the departing from of waveform shape of the waveform shape of synthetic digital voice signal and original digital voice signal.
The VAD institute that prior art adopts for to as if the input of sampling speech after the pretreated digital voice signal frame that after preliminary treatment, forms again of the digital voice signal frame that forms or the rear digital voice signal frame of sampling; Usually can depart from peak on the waveform of the corresponding digital voice signal (or original pretreated digital voice signal) that is used for VAD with encode peak on the waveform of continuous coded frame synthetic digital voice signal of generation after deciphering of producing of the mode of linear prediction analysis and code book excitation; AMR-NB and the vocoder of AMR-WB of this paper by 3GPP provides concrete example to the coding of concrete sound
Peak-peak position between 1.157 seconds and 1.160 seconds in the waveform of T22.inp (the suffix name of inp-file) the pairing voice signal of file in the T_inp catalogue of the T.zip file in the TS26.174-540.zip of 3GPP (zip is the suffix name of the file) file is exactly the explanation about this point with serving as that input is that frame under the corresponding peak on the waveform of the code rate synthetic digital voice signal that carries out forming behind the coding and decoding is not corresponding mutually with 23.05kb/s with the T22.inp file below:
As shown in Figure 7, the decline of the waveform of 58 frames of the digital voice signal that T22.inp is specified (among the figure before 1.16 seconds) has a peak-peak, for the synthetic audio digital signals after the decoding, as shown in Figure 8, the peak value of corresponding waveform appears in synthetic digital voice signal 59 frames (after 1.16 seconds) that the coded frame with the 23.05kb/s rate coding produces after deciphering, synthetic digital voice signal frame 59 be the frame 58 than correspondence late a frame, so the waveform peak that does not have in 59 frames of original signal has but been arranged in 59 frames of the synthetic digital voice signal that produces with the decoded back of the coded frame of 23.05kb/s rate coding.This is that the part signal that is used for the 59 frame of digital voice signals of VAD will be used for the coding of the AMR-WB frame of 60 frames because be used for the 59th frame of digital voice signal and the 59 frame of digital voice signals that are used for the AMR-WB coding and incomplete same of VAD.
So the synthetic digital voice signal frame of digital voice frame and its correspondence not necessarily has the consistent time domain and the sound characteristic of frequency domain.Be used for VAD sampled digital Speech frame (or pretreated digital voice frame) VAD result also and do not mean that the synthetic digital voice signal frame of its correspondence has identical with it VAD result, particularly the encoded operation of the resonance peak that is detected on the digital voice incoming frame that is used for VAD when be mapped to its adjacent after under the situation on pairing synthetic digital Speech frame of digital voice incoming frame that is used for VAD.
The present invention will solve the input signal frame of coding front and back and the inconsistent harmful effect that VAD is brought of characteristics of speech sounds of the decoded synthetic digital signal frame of coded frame; And the inconsistent harmful effect that brings of waveform character between the two, for example, the VAD result of 392 frames of the pretreated digital voice signal of voice signal that above-mentioned DTX4.INP is specified has speech but 393 frames are no speeches can cause 392 frames by 393 frames are by ground unrest code rate coding situation by voice pattern-coding rate coding, and the waveform peak of such 392 frames just can not be reflected on the synthetic digital signal frame of variable rate coding.
If detecting, voice activation to carry out at synthetic digital Speech frame, producing the coding how whether parameters such as the linear prediction of this synthetic digital voice signal frame and the resulting pumping signal of code book search operation, wave filter memory, wave filter error can and be used for next frame so, also is the problem to be solved in the present invention.
Technical scheme
Whether AMR-WB coded frame resulting digital voice frame after deciphering has speech, this judgement can also detect and makes by this digital voice frame being carried out voice activation, so the present invention adopts directly the synthetic digital voice signal frame to the AMR coded frame to carry out the method that VAD detects.
Concerning the synthetic digital Speech frame of generation speech pattern code rate of the present invention and with it as the method for the object of VAD, on the one hand, generate synthetic digital Speech frame and relate to operations such as the linear prediction carried out incessantly in the AMR-WB speech pattern encoding operation, code book search; On the other hand, when VAD result be that no speech can cause low code rate AMR-WB frame of scrambler output even ground unrest coded frame.Uninterruptedly the sound effect of the speech pattern of (for example constant speed) coding is better than the sound effect of the variable rate coding of speech pattern and ground unrest mode mixture mode, so the parameter of using speech pattern high-rate coded (or generating synthetic digital Speech frame) to be produced when carrying out the coding of two-forty of speech pattern again behind the coding of low rate or ground unrest pattern helps improving voice quality.
So, the present invention proposes another kind of method, same speech incoming frame has been carried out twice code book search and had only a kind of coded frame of code rate to be selected as the AMR-WB transmit frame under the situation that code translator sends in another the low code rate (or ground unrest code rate) that relates to speech pattern (non-ground unrest code rate) and speech pattern, the parameter of using speech pattern coding to be produced selectively is used for the coding of next frame, the present invention provide this selection scheme.
The scheme of selection of the present invention makes, finishes behind the coding of AMR-WB frame of current input signal frame at scrambler and after code translator finishes the decoding of this AMR-WB frame, the pumping signal that both sides are consistent.Reaching the benefit that such effect brings is, under both sides are consistent the prerequisite of pumping signal, as long as the linear spectral frequency LSF parameter that relating in the AMR-WB frame of speech pattern constructed linear prediction synthesis filter transmits errorless, for the coding and decoding both sides, just can agree by the synthetic digital Speech frame that linear prediction synthesis filter responsing excitation signal is exported.
The coding and decoding both sides are consistent in the technical scheme of the present invention of pumping signal, scrambler need be determined pumping signal according to the AMR-WB frame of its output, when output frame was the AMR-WB frame of ground unrest pattern, scrambler reset to pumping signal the fixed value of a scrambler and code translator both sides agreement; When output frame is the AMR-WB coded frame of speech pattern, scrambler by integer and mark pitch delay and LTP-filtering-flag long-term forecasting-filtering-sign to a last subframe and before pumping signal carry out interpolation and obtain the self-adapting code book signal at last, this self-adapting code book signal is again by the signal that obtains after the self-adapting code book yield value amplification that quantizes, the signal that obtains after amplifying by the fixed codebook gain value that quantizes with the fixed code book signal superposes, with resulting signal as pumping signal.
Speech pattern AMR-WB coded frame comprises the quantification gain and the fixed code book signal of integer and mark pitch delay, self-adapting code book, but does not directly comprise the fixed codebook gain parameter, but comprises fixed codebook gain and prediction gain g ' cBetween the quantization encoding parameter of modifying factor because the AMR-WB scrambler has been arranged consistent prediction gain g ' with the code translator both sides cSo both sides just can agree on pumping signal.
The AMR-WB scrambler is by arranging consistent fixed code book prediction gain g ' with the consistent quantification energy predicting error of its AMR-WB code translator agreement c, by the prediction gain g ' shown in the front formula (3) cCalculating formula in as can be known: the prediction energy (predicted energy) that has only subframe
Figure A20081003683500101
Determine that by quantizing the energy predicting error value of the mean value E of renewal energy is constant, on average upgrades ENERGY E IOnly relevant with the fixed code book signal, about this point, formula (51) in 5.9 joints of the TS26.190-510 version of 3GPP has provided explanation, so the AMR-WB code translator is by obtaining the code rate and the fixed code book parameter of AMR-WB coded frame, can with the AMR-WB scrambler in mean value E that upgrades energy and the average ENERGY E of upgrading IOn obtain in full accord, if use the quantification energy predicting error of four same subframes to calculate the prediction energy of subframe
Figure A20081003683500111
Scrambler and code translator both sides' prediction gain g ' cAlso in full accord.
Existing 3GPP standard has provided a kind of method of the quantification energy predicting error that agreement is consistent between AMR-WB scrambler and the code translator, promptly, when the transmit frame of AMR-WB scrambler is the AMR-WB coded frame of speech pattern, illustrate that by the formula (53) in 5.9 joints of TS26.0190-510 version energy predicting error R (n) 20 is multiplied by the logarithm of stating modifying factor, quantize the energy predicting error and then be 20 and take advantage of the logarithm of quantification modifying factor; When coded frame was ground unrest code rate frame, the quantification energy predicting error of coder both sides' subframe remained unchanged.
Above-mentioned this between AMR-WB scrambler and code translator the agreement the consistent scheme that quantizes the energy predicting error be not unique, for example, in the AMR-NB of 3GPP scheme, the logarithmic mean value of the frame energy of the quantification that provides in the AMR-NB coded frame according to this ground unrest code rate (averaged logarithmic energy) is set exactly, the 5.2 joint frame energy of the TS26.092-500 of 3GPP calculate provided in (Frame energy caculation) according to before the explanation of calculating frame energy logarithmic mean value of the frame energy of frame; In fact for this method that all generates synthetic digital Speech frame for each input signal frame of the present invention, can all generate modifying factor for each input signal frame, and when transmit frame is silence description frames also with the modifying factor one of four subframes of this frame in the same way code translator send, like this, the coder both sides have just kept the consistance that quantizes the energy predicting error parameter, though increased a spot of bit number that sends than the way that originally only sends silence description frames.
Encode for AMR-WB, data on 256 sample points of all of the pumping signal of previous frame are not will use all, because the upper limit of the 26.190 regulation self-adapting code book hunting zones of 3GPP is 231, if, for the coding of back one frame, need the pumping signal on nearest 248 sample points of present frame at most so the scope of search is limited in 26.190 the specialized range of 3GPP.
Be exactly to carry out the technical scheme that voice activation detects below according to synthetic digital audio signal:
A kind of an input signal frame in the input signal frame sequence is carried out self-adapting code book search, fixed codebook search and AMR-WB AMR-WB coding and a back input signal frame adjacent with this input signal frame carried out the speech pattern code rate AMR-WB Methods for Coding of non-ground unrest, it is characterized in that
A described input signal frame is carried out linear prediction, and determine linear prediction synthesis filter according to resulting linear forecasting parameter, by speech pattern code rate to a described input signal frame self-adapting code book search for, fixed codebook search, and, this pumping signal filtering is generated synthetic digital audio signal frame with this linear prediction synthesis filter according to resulting self-adapting code book parameter and fixed code book parameter generation pumping signal;
Carry out voice activation according to described synthetic digital audio signal frame and detect, determine the transmission types signal of discontinuous transmission according to this voice activation testing result;
If described voice activation testing result is that speech is arranged, according to the described speech pattern code rate coding AMR-WB coded frame that is a described input signal frame, and, generate the pumping signal of a described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of employed pitch delay, self-adapting code book in this coded frame; If described voice activation testing result is that no speech and described transmission types signal are normal speech SPEECH_GOOD, the AMR-WB coded frame that described input signal frame coding is generated by lower another speech pattern code rate of speed, and, generate the pumping signal of a described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of the pitch delay in this frame, self-adapting code book; If being quiet description, the transmission types signal upgrades the quiet description of the AMR-WB AMR-WB_SID frame that SID_UPDATE then generates described input signal frame by ground unrest code rate coding; If the transmission types signal is that quiet description begins the AMR-WB_SID frame that SID_FIRST then generates the information of not carrying of described input signal frame; If described transmission types signal is not SPEECH_GOOD, with a described input signal frame pumping signal reset;
According to a described input signal frame pumping signal an adjacent back input signal frame is carried out the encoding operation of the voice mould pattern-coding speed of non-ground unrest.
The control DTX of discontinuous transmission in said method and operational module still are that each frame in the input signal frame sequence produces a transmission types signal TX_TYPE, but determining of this transmission types signal will be according to the result of the voice activation detection that synthetic digital audio signal frame is done, and this is different from the way of the synthetic digital audio signal frame of not considering coded frame of prior art.
For said method, to keep the prerequisite of consistent quantification energy predicting error based on the AMR-WB codec, it has accomplished to make both sides that consistent pumping signal is arranged.Have as for the method for keeping consistent quantification energy predicting error and to list one by one below multiple:
First kind, scrambler is only when sending the AMR-WB frame of speech pattern, to quantize the energy predicting error update according to the modifying factor in the coded frame, all the other the time remain unchanged, promptly, if described transmission types signal is SPEECH_GOOD, according to employed modifying factor correctionfactor in the AMR-WB coded frame of the described non-ground unrest code rate of a described input signal frame, generate the quantification energy predicting error of a described input signal frame, promptly, in described voice activation testing result is when speech is arranged, this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of a described speech pattern code rate of described input signal frame, in described voice activation testing result is no speech and described transmission types signal when being normal speech SPEECH_GOOD, and this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of the lower speech pattern code rate of described another speed of described input signal frame; If described transmission types signal is not SPEECH_GOOD, the quantification energy predicting error of the subframe of the last input signal frame that the quantification energy predicting error of the subframe of a described input signal frame is set to be adjacent.
Second kind, scrambler is when sending the AMR-WB frame, to quantize the energy predicting error update according to the modifying factor in the coded frame, all the other the time remain unchanged, the coding that also sends modifying factor when sending SID frame or no datat NO_DATA is to code translator simultaneously, and the channel that is sent in separately of the transmission of modifying factor and AMR-WB frame independently carries out; Code translator is when receiving the AMR-WB frame of speech pattern, to quantize the energy predicting error update according to the modifying factor in the coded frame, when receiving the SID frame or no datat modifying factor in the receiving belt external information and will quantize the energy predicting error update when receiving according to modifying factor.
The third, coding one side is when described transmission types signal is SPEECH_GOOD, according to employed modifying factor correction factor in the AMR-WB coded frame of the described non-ground unrest code rate of a described input signal frame, generate the quantification energy predicting error of the subframe of a described input signal frame, promptly, in described voice activation testing result is when speech is arranged, this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of a described speech pattern code rate of described input signal frame, in described voice activation testing result is no speech and described transmission types signal when being normal speech SPEECH_GOOD, and this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of the lower speech pattern code rate of described another speed of described input signal frame; At described transmission types signal is that quiet description and decoding side when beginning SID_FIRST or quiet description and upgrading SID_UPDATE arranges a same quantification energy predicting error amount, and this value has the various definitions method, for example, gets fixed value; When described transmission types signal was no datat NO_DATA, the quantification energy predicting error of the subframe of last input signal frame that will be adjacent with a described input signal frame was as the quantification energy predicting error of the subframe of a described input signal frame.
Because the AMR-WB frame of coding ground unrest pattern does not need the pumping signal of previous frame and quantizes the energy predicting error, in above-mentioned scrambler, described input signal frame pumping signal and quantize the encoding operation that the energy predicting error only is used to an adjacent back input signal frame is carried out the non-ground unrest code rate of speech pattern.
Below be the explanation of technical scheme of directly synthetic digital audio signal being carried out the AMR-WB scrambler of VAD, that is,
A kind of AMR-WB AMR-WB scrambler that can discontinuous transmission, in described AMR-WB scrambler, input signal frame is carried out linear prediction, determine transmission types TX_TYPE according to the voice activation testing result, determine the code rate of AMR-WB coded frame according to described voice activation testing result and described TX_TYPE, according to this code rate is described input signal frame coding AMR-WB coded frame, output type is the AMR-WB transmit frame of TX_TYPE, and generate the pumping signal of the described input signal frame of the next input signal frame that is used to encode, it is characterized in that
Determine linear prediction synthesis filter by input signal frame being carried out the linear forecasting parameter that linear prediction obtains;
According to a speech pattern code rate search of input signal frame self-adapting code book, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter are generated pumping signal, this pumping signal filtering is generated synthetic digital audio signal frame with described linear prediction synthesis filter;
Obtain described voice activation testing result according to the voice activation detection that described synthetic digital audio signal frame is carried out;
If described voice activation testing result is that speech is arranged, according to by a described speech pattern code rate input signal frame being carried out self-adapting code book search, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter is described input signal frame coding AMR-WB transmit frame, and, generate the pumping signal of described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of employed pitch delay, self-adapting code book in this coded frame;
If described voice activation testing result is that no speech and described TX_TYPE are normal speech SPEECH_GOOD, by the lower speech pattern code rate of another speed is described input signal frame coding AMR-WB transmit frame, and, generate the pumping signal of described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of the pitch delay in this frame, self-adapting code book;
Beginning SID_FIRST or quiet description renewal SID_UPDATE if described TX_TYPE is quiet description, is input signal frame coding AMR-WB transmit frame by the ground unrest code rate, and the pumping signal of described input signal frame is resetted;
If described TX_TYPE is no datat NO_DATA, the pumping signal of described input signal frame is resetted.
Above-mentioned AMR-WB scrambler detects VAD because advanced jargon sound activates, determine TX_TYPE again, because the AMR-WB coder makes quantification energy predicting error each other reach consistent according to bipartite AMR-WB coded frame, so the above-mentioned scrambler scheme of quantification energy predicting error really is just comparatively simple, for example, when TX_TYPE is SPEECH_GOOD, be provided with and quantize the energy predicting error, and when TX_TYPE is SID, be provided with or remain unchanged by the frame energy of input signal frame according to modifying factor.
The technical scheme that employing is provided with (method of AMR-WB) by the frame energy of input signal frame can make scrambler energy of the present invention and according to the code translator compatibility of the AMR-WB standard of 3GPP, this scrambler comprises the device of quantification energy predicting error of four subframes of the needed input signal frame of speech pattern AMR-WB frame of a back input signal frame of determining that coding is adjacent with described input signal frame, it is characterized in that, this device is determined the quantification energy predicting error of four subframes of described input signal frame according to described voice activation testing result and transmission types signal TX TYPE, promptly
In described voice activation testing result is when speech is arranged, and this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of a described speech pattern code rate of described input signal frame;
In described voice activation testing result is no speech and described transmission types signal when being normal speech SPEECH_GOOD, and this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of the lower speech pattern code rate of described another speed of described input signal frame;
At described TX_TYPE is quiet description when beginning SID_FIRST or quiet description and upgrading SID_UPDATE, and the quantification energy predicting error of four subframes of the described input signal frame of this device is set to the frame energy logarithmic mean value of the quantification of described input signal frame;
If described transmission types signal is no datat NO_DATA and described TX_TYPE when being normal speech SPEECH_GOOD, the quantification energy predicting error of the subframe of the last input signal frame that this device will be adjacent with described input signal frame is as the quantification energy predicting error of the subframe of described input signal frame.
Scrambler of the present invention and coding method the most obvious part unlike the prior art is exactly that object extension with VAD has arrived synthetic digital voice signal, thereby can utilize the feature of resonance peak on synthetic digital voice signal waveform to detect speech.
Because synthetic digital voice signal has higher energy in the resonance peaks of prediction synthesis filter limit correspondence, at the amplitude that synthetic digital voice signal frame is carried out can detecting when voice activation detects its crest, if the amplitude of the rising edge of its crest and negative edge all surpasses or one of them is just adjudicated this frame for speech is arranged above threshold value, like this, surpass threshold value in case the pairing harmonic peak of described limit is reflected in the amplitude of the crest of the vibration on the waveform, synthetic digital voice signal frame just can not missed when VAD detects.The spike of the crest of the synthetic digital voice signal of in background technology occurring, pointing out during than the more outstanding phenomenon of original signal those outstanding spikes can more easily use with threshold ratio method and be detected, equally, during the steeper situation of the rising edge of the crest of the synthetic digital voice signal of pointing out in background technology occurring, those outstanding spikes just can more easily be detected with rising edge and threshold ratio method.The establishing method that is used for the threshold value of the rising edge comparison of crest is not unique, the definite of this threshold value can use fixed value, also can be relevant with the synthetic digital voice signal frame at crest place, such as, can be with reference to the average amplitude of synthetic digital voice signal frame---the absolute value of the signal value in the frame on the sample point and, also can be with reference to the level of the specific subband that synthesizes the digital voice signal frame, the 3.3.1 of 3GPP26194-500 joint bank of filters and subband level calculate (Filter bank and computation of sub-band levels) and have provided a kind of method of asking the level of subband.Getting parms from the speech pattern coded frame for above-mentioned scrambler of the present invention and again generates the coding method of pumping signal, and the method for the wave test of following VAD is just arranged,
Determine threshold value according to detected synthetic digital audio signal frame, if the amplitude of the rising edge of the crest in the waveform in the described synthetic digital audio signal frame surpasses this threshold value, just the result that described voice activation is detected has been defined as speech.
Determine amplitude threshold and scope according to detected synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.
Voice activation detection method of the prior art stands good to synthetic digital voice signal, for waveform medium wave peak number is a lot of but the situation that rising edge and negative edge amplitude are more or less the same, of the prior art signal energy can be come by detecting signal with the method for ground unrest energy comparison.But for the less situation of waveform medium wave peak number, the ability of signal that the method that the present invention provides in the above detects speech is stronger:
Technical scheme of the present invention is not repelled yet digital voice sample signal (or its pretreated digital signal) is carried out the detection of projects such as signal to noise ratio (S/N ratio) and determines TX_TYPE according to the result who detects, though in the embodiments of the invention what be input to that the VAD device detects is synthetic digital audio signal but not through pretreated voice signal (or digital sample voice signal).
Beneficial effect
Carry out the method for VAD again owing to adopted the search of first execution linear prediction and code book, like this, the appearance of the pumping signal that is generated according to code book search and linear prediction is just operated prior to VAD, carry out VAD at pumping signal by the output of linear prediction synthesis filter, like this, if the original figure voiced frame is through linear prediction, the feature of the synthetic video signal of the formation after self-adapting code book search and fixed codebook search are handled has speech, in the result of VAD is exactly speech, and the phonetic feature of the audio digital signals frame that the AMR coded frame of the non-ground unrest code rate that receive decoding side produces after deciphering is similar to the phonetic feature of the synthetic audio digital signals that is used to detect of this code rate of coding staff; Coding staff just might produce the AMR coded frame of SID type of coding under the situation that can't detect the synthetic audio digital signals with active speech.
The present invention directly is positioned at the object of VAD on the pairing synthetic digital voice signal frame of AMR coded frame of non-ground unrest code rate, because of can causing the VAD result of the synthetic digital voice signal frame of this code rate, the code rate reduction trends towards not having active speech, promptly, voice signal for frame with some, use method of the present invention, the code rate reduction can make the number increase of the result of the VAD judgement of doing according to difference between incoming signal level and ground unrest estimated value for the frame of no speech.Therefore, the present invention can also improve the sound compressibility of AMR coding techniques, makes same Radio Resource can hold more voice signal.
Carry out the method for VAD again owing to adopted the search of first execution linear prediction and code book, like this, pressing the appearance of the pumping signal of non-ground unrest code rate generation just operates prior to VAD, operate prior to VAD on the order that operates in execution by the search of the code book of non-ground unrest code rate, the parameter that generates the pumping signal that produces when synthesizing digital voice signal by non-ground unrest code rate when the transmission types indication that DTX control and operational module is produced as the no speech result of VAD is not normal voice (SPEECH_GOOD) just can not be used further to the coding of the non-ground unrest code rate of next frame, of the present invention abandoning selectively in the case carried out linear prediction under the speech pattern, self-adapting code book search and the resulting parameter of fixed codebook search, promptly, except pumping signal and quantification energy predicting error parameter that use coding ground unrest code rate coded frame is produced, just can utilize under speech pattern when synthesizing digital audio signal and carry out linear prediction for the generation of next frame input audio signal, self-adapting code book search and fixed codebook search are operated resulting parameter, abandon behind the SID frame of encoding the prior art carrying out other parameter that linear prediction and code book search produce and needn't resemble again by non-ground unrest code rate, because this scheme has been arranged, the feature that contains more input audio signal for the synthetic digital audio signal that is used for the voice activation detection of next input audio signal frame generation, because in the prior art, in case run into the ground unrest speed coding frame one time, the state variable that comprises pumping signal and quantification energy predicting error in the AMR-WB scrambler all can be resetted, scrambler has been lost the feature of input audio signal in the past this moment.
When the result of VAD is that transmission types that no speech and DTX control and operational module produce is indicated when being normal voice (SPEECH_GOOD), because voice activation detects the object of VAD directly at the synthetic digital audio signal of input audio signal frame, can only when in synthetic digital audio signal, not having speech, reduce the code rate of speech pattern like this.
After receiving the AMR-WB coded frame of speech pattern, speech pattern coding module in take over party's code translator and the scrambler is respectively with reference to the pumping signal on the consistent past sample point that comprises previous frame subframe sample point and the quantification energy predicting error of four subframes, parameter in the coded frame of receiving on the one side use channel, the opposing party uses and oneself is encoded to the parameter of going in this coded frame, generate the pumping signal and the synthetic speech of subframe separately respectively, so take over party's code translator synthetic the synthetic pumping signal of pumping signal and described voice coding module in full accord, code translator uses the pumping signal consistent with scrambler to make the acoustical quality of the synthetic speech of deciphering generation guaranteed.
The amplitude that the amplitude of the crest that will synthesize digital voice signal of the present invention can be reflected in the crest on the waveform at the harmonic peak of prediction synthesis filter limit correspondence with threshold ratio VAD method detects the synthetic digital voice signal frame at this crest place when being higher than threshold value.When the spike of the synthetic digital voice signal of mentioning in background technology when more outstanding this phenomenon is embodied in the rising edge of the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak or negative edge than original signal bigger than original signal, the amplitude of the above-mentioned crest that will synthesize digital voice signal can detect the frame that can't detect by the spike that detects original signal waveform with threshold ratio method.Equally, when the rising edge of the above-mentioned synthetic digital voice signal rising edge that more steep this phenomenon is embodied in the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak than original signal during than original signal bigger, the rising edge that will synthesize the crest of digital voice signal of the present invention can detect the frame that can't detect originally with threshold ratio method.Equally, more steep this phenomenon is embodied in the slope ratio original signal of rising edge of the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak when bigger than original signal when the rising edge of above-mentioned synthetic digital voice signal, and the slope of the rising edge of the crest of synthetic digital voice signal can be detected the frame that can't detect originally with threshold ratio method.
Description of drawings
Fig. 1 is the theory diagram of AMR-WB (AMR-WB) scrambler of supporting the variable bit rate of constant rate of speed speech pattern coding.
Fig. 2 is the simplified block diagram of voice coding module among Fig. 1.
Fig. 3 is the simplified block diagram of low rate voice coding module among Fig. 1.
Fig. 4 is the AMR-WB scrambler by DTX control and the output of dispensing device control AMR-WB coded frame.
Fig. 5 is the simplified block diagram of the voice coding module among Fig. 4.
Fig. 6 is the simplified block diagram of the low rate voice coding module among Fig. 4.
Fig. 7 is that 1.16 on the figure is meant 1.16 seconds the moment as the 59th frame of the T22.inp among the TS26.174-540 of the 3GPP of input signal through pretreated digital voice signal.
Fig. 8 is to be that input signal is the 59th frame of the synthetic digital voice signal behind the coding and decoding of code rate with 23.05kb/s with the T22.inp among the TS26.174-540 of 3GPP, and 1.16 on the figure is meant 1.16 seconds the moment.
Embodiment
Embodiment 1, AMR-WB (AMR-WB) scrambler that can between constant code rate speech pattern and discontinuous transmitting DTX pattern, switch, as shown in Figure 1, the voice sample rate is that the 14 bit uniform pulse of 16kHz are modulated (PCM) signal frame 1 simultaneously to the voice coding module, low rate voice coding module and the output of ground unrest coding module, the voice coding module is selected module output with AMR-WB (AMR-WB) coded frame 11 of the non-ground unrest code rate of signal frame 1 to coded frame output, low rate voice coding module is selected module output with AMR-WB (AMR-WB) coded frame 14 than the speech pattern of low rate of signal frame 1 to coded frame output, the ground unrest coding module is selected module output with the quiet description coded frame 12 of AMR-WB (AMR-WB) of the ground unrest code rate of signal frame 1 to coded frame output, the synthetic digital voice signal frame 17 that the voice coding module produces during also with coded signal frame 1 is exported to the voice activation detection module, the method of the local synthetic speech of generation (local synthesized speech) that provides in 5.10 joints of the generation of synthetic digital voice signal frame 17 according to the 26.190-500 of 3GPP produces, the voice activation detection module carries out voice activation to synthetic digital voice signal frame 17 and detects, and the result that will detect---VAD sign 18 is to discontinuous transmission (DTX) control and operational module and post-processing module output, and DTX control and operational module output transmission types signal 19 are selected module and post-processing module to coded frame output.
Coded frame output selects module that the transmission types signal of receiving 19 is exported to 3G (3G (Third Generation) Moblie) wireless access network (AN).Transmission types signal 19 is normal speech (SPEECH_GOOD), quiet description begins (SID_FIRST), (SID_UPDATE) upgraded in quiet description, one of four kinds of no datat (NO_DATA), when transmission types signal 19 is normal speech (SPEECH_GOOD), it is AMR-WB (AMR-WB) coded frame 11 or the AMR-WB coded frame 14 of encoding by non-ground unrest code rate (speech pattern) that the information bit 2 of module output is selected in coded frame output, when VAD sign 18 for have speech then the content of information bit 2 are AMR-WB frames 11, when VAD sign 18 for no speech then the content of information bit 2 are AMR-WB frames 14; When transmission types signal 19 was quiet description renewal (SID_UPDATE), it was the quiet description of AMR-WB (AMR-WB_SID) frame 12 of encoding by the ground unrest code rate that the information bit 2 of module output is selected in coded frame output; When transmission types signal 19 is quiet descriptions when beginning (SID_FIRST), it is the SID_FIRST frame that the 5.1.1 joint according to 3GPP technical manual TS26.193-500 forms that the information bit 2 of module output is selected in coded frame output; When transmission types signal 19 was no datat (NO_DATA), information bit 2 was invalid for the AN of 3G.
Discontinuous transmission control and operational module be received code mode signal 5 also; coded system signal 5 indication constant code rate speech pattern or discontinuous transmitting DTX patterns; the transmission types signal 19 that discontinuous transmission control and operational module send when coded system signal 5 is the discontinuous transmitting DTX pattern can be normal speech (SPEECH_GOOD); quiet description begins (SID_FIRST); (SID_UPDATE) upgraded in quiet description; among four kinds of the no datat (NO_DATA) any one; the content of transmission types signal 19 only indicates 18 operation result decision by DTX control and operational module according to VAD at this moment; transmission types signal 19 contents are normal speech (SPEECH_GOOD) when coded system signal 5 is constant code rate speech pattern; promptly; VAD sign 18 has outputed to discontinuous transmission control and operational module; but discontinuous transmission control and operational module receive that this signal (no matter its content has speech or no speech) back is the transmission types signal 19 of normal speech (SPEECH_GOOD) with regard to output content; discontinuous transmission control and operational module reset to original state with its state variable, and the AMR-WB frame 11 of voice coding module coding is placed to the AN that sends to 3G in the information bit 2.
If discontinuous transmission (DTX) control and operational module are indicated the transmission types that transmission types signal 19 is set at normal speech (SPEECH_GOOD) according to the VAD sign 18 of input, discontinuous transmission (DTX) control and operational module are also indicated to the post-processing module transmission types---normal speech (SPEECH_GOOD).
Except receiving speech pattern signal 5, VAD sign 18 and transmission types signal 19, post-processing module also receives the pumping signal 31 that voice coding module coding AMR-WB frame 11 produced and quantizes energy predicting signal 32, and the pumping signal 33 that produced of low rate voice coding module coding AMR-WB frame 14 and quantize energy predicting signal 34.Post-processing module is to voice coding module and low rate voice coding module output drive signal 35 and quantize energy predicting error 37, and the method that produces pumping signal 35 and quantification energy predicting error 37 is as follows:
If coded system signal 5 is constant code rate speech patterns, pumping signal 35 is respectively pumping signal 31 with the value that quantizes energy predicting error 37 and quantizes energy predicting error 32, and will quantize energy predicting error 32 as the preservation of previous frame quantification energy predicting error; If transmission types signal 19 is that SPEECH_GOOD and VAD sign 18 are that speech is arranged, pumping signal 35 is respectively pumping signal 31 with the value that quantizes energy predicting error 37 and quantizes energy predicting error 32, and will quantize energy predicting error 32 as the preservation of previous frame quantification energy predicting error; If transmission types signal 19 be SPEECH_GOOD and not VAD sign 18 are no speeches, pumping signal 35 is respectively pumping signal 33 with the value that quantizes energy predicting error 37 and quantizes energy predicting error 34, and will quantize energy predicting error 34 as the preservation of previous frame quantification energy predicting error; If transmission types signal 19 upgrades among three of (SID_UPDATE) and the no datat (NO_DATA) any for quiet description begins (SID_FIRST), quiet description, pumping signal 35 is that 248 sample points are 0 reset values entirely, the value that quantizes energy predicting error 37 is the save value that previous frame quantizes the energy predicting error, and the save value of keeping previous frame quantification energy predicting error simultaneously is constant.
The block diagram of the coded portion on the right of transmit leg among Fig. 1 of the TS26.171 of Fig. 1 and 3GPP (TRANSMIT SIDE) is similar, difference is that voice activation detects the signal difference that (Voice Activity Detector) module receives from the voice coding module there, 3GPP Fig. 1 of TS26.071 in be speech sample through the pretreated signal of voice coding (Speech Encoder) module, among this paper Fig. 1 to be the voice coding module carry out linear prediction and quantification to the voice digital signal frame of input, the synthetic audio digital signals frame that is generated after self-adapting code book search and the fixed codebook search.Among Fig. 1 of this paper, at transmission types signal 19 is to select one as information bit (info bits) 2 normal speech (SPEECH_GOOD) or the quiet description quiet description of AMR-WB (AMR-WB_SID) coded frame that coded frame output selects AMR-WB coded frame that module will generate from the voice coding module, AMR-NB coded frame that low rate voice coding module generates and ground unrest coding module to generate when upgrading (SID_UPDATE), and indicates the VAD sign of the AMR-WB frame of non-ground unrest code rate in the 18 configuration information bits 2 according to VAD; Different with the present invention, 3GPP Fig. 1 of TS26.171 in vocoder frames (speech frame) 4 and silence description frames (SID frame) 5 can not occur simultaneously, do not have this operation that elects.
Shown in Figure 2 is the simplified block diagram of voice coding module among Fig. 1, it has provided the Signal Processing flow process, Fig. 2 among the TS26.190-500 of this figure and 3GPP (detailed diagram of ACELP scrambler) is basic identical, A among Fig. 2 (z) is the reverse wave filter (The inverse filter with quantized coefficients) of not quantization parameter
Figure A20081003683500181
It is the reverse wave filter (The inverse filter with quantized coefficients) of quantization parameter, s (n) is the signal of pre-emphasis, T0 is best open loop time delay, h (n) is the impulse response of weighted synthesis filter (weighted synthesis filter), x (n) is the echo signal of self-adapting code book search, x 2(n) be the echo signal of upgrading (innovation) search, the description of each chapters and sections has covered the content of its Fig. 2 in TS26.190, so also covered the related content identical with its Fig. 2 of Fig. 2 of this paper.
Parameter in the AMR-WB coded frame 11 among Fig. 1 just comes from ISF index, fundamental tone index, codebook index, gain vector index and the filter index among Fig. 2; Parameter in the AMR-WB coded frame 14 among Fig. 1 just comes from ISF index, fundamental tone index, codebook index, gain vector index and the filter index among Fig. 3.
The different place of Fig. 2 with among the TS26.090-500 among Fig. 2 of this paper is: the parametric configuration linear prediction synthesis filter that the voice coding module shown in this paper Fig. 2 is utilized linear prediction and quantized to obtain produces pumping signal filtering with this composite filter and to synthesize digital audio signal frame 17.
Give the voice coding module pumping signal of pumping signal 35 among this paper Fig. 2 as present frame, and with quantizing the signal of energy predicting error 37 as the quantification energy predicting error of four subframes in the present frame;
Shown in Figure 3 is the simplified block diagram of low rate voice coding module among Fig. 1, and it has provided the Signal Processing flow process, and except not producing synthetic digital audio signal frame, it is identical with Fig. 2.
Embodiment 2, AMR-WB scrambler as shown in Figure 4 to an input voice signal frame coding, the voice coding module is operated in higher code rate, low rate voice coding module is operated in low code rate, voice signal incoming frame 42 is the even PCM frames of 14 bits, the 43rd, the VAD sign, the voice coding module generates AMR-WB coded frame 44, low rate voice coding module generates AMR-WB coded frame 41, the ground unrest coding module generates the quiet description of AMR-WB (SID) frame 45, the 46th, the indication of transmission types TX_TYPE, the 47th, pass to the information bit of 3G Access Network, the voice coding module is carried out the synthetic digital voice signal frame 48 that the search of linear prediction and code book obtains to the even PCM frame of 14 bits, the even PCM frame of 14 bits is carried out the pretreated voice signal frame 49 that obtains after the pre-service.
Voice activation detection module among Fig. 4 detects synthetic digital voice signal frame 48, the voice coding module receives the input audio signal frame 42 of the even PCM of 14 bits, send it to the voice activation detection module pretreated speech digital signal is carried out linear prediction, the synthetic digital voice signal frame 48 that obtains after self-adapting code book search and the fixed codebook search, that is: amplify the back with self-adapting code book by self-adapting code book gain and amplify the back addition with fixed code book by fixed codebook gain and obtain pumping signal, again with prediction (LP) parameter of pumping signal by obtaining by linear prediction-
Figure A20081003683500191
Determined linear prediction synthesis filter obtains synthetic digital voice signal frame 48 (linear prediction synthesis filter that is used for synthetic digital Speech frame also can be determined by linear forecasting parameter A (z)), the voice activation detection module is according to the resultant VAD result of detection to synthetic digital voice signal frame 48---and VAD sign 43 is exported to DTX control and operational module, and the method that provides of the technical manual of 3GPP is that pretreated digital voice signal is detected in contrast thereto.
The simplified block diagram of the voice coding module of Fig. 4 as shown in Figure 5, the simplified block diagram of the low rate voice coding module of Fig. 4 is as shown in Figure 6.
DTX control here and operational module generate transmission types TX_TYPE 46 according to VAD sign 43; DTX control and operational module are also according to the content in VAD sign 43 and the TX_TYPE signal 46 definite information bits 47 in the present embodiment, and the pumping signal of the previous frame of using when voice coding module and low rate voice coding module coding AMR-WB frame and the energy predicting error that quantizes (pumping signal input 51 and quantize energy predicting error input 50), concrete grammar is as follows:
When the transmission types 46 of present frame is not SPEECH_GOOD, the reset values of using known pumping signal is as pumping signal input 51, the use previous frame quantizes the save value of energy predicting error as quantizing energy predicting error input 50, and the save value that DTX control and operational module are kept previous frame quantification energy predicting error is constant; When the VAD of present frame sign 43 is (can cause transmission types 46 to be normal voice SPEECH_GOOD) when speech is arranged, pumping signal 53 conducts of the present frame of use voice coding module output are as pumping signal 51, use imports 50 from the quantification energy predicting error 52 (the quantification energy predicting errors of four subframes of present frame) of the present frame of voice coding module as quantizing the energy predicting error, and will quantize energy predicting error 52 as the preservation of previous frame quantification energy predicting error; When the transmission types 46 of present frame is that SPEECH_GOOD and VAD sign 43 are when being no speech, use pumping signal 55 conducts of the present frame of low rate voice coding module output to import 51 as pumping signal, use imports 50 from the quantification energy predicting error 54 (the quantification energy predicting errors of four subframes of present frame) of the present frame of voice coding module as quantizing the energy predicting error, and will quantize energy predicting error 54 as the preservation of previous frame quantification energy predicting error.The pumping signal of present frame is the signal value that comprises on 248 sample points of last subframe at least.
DTX control and operational module are that (can cause transmission types 46 is normal voice SPEECH_GOOD) put AMR-WB coded frame 44 47 li of information bits and sent to 3G Access Network (AN) when speech was arranged at VAD sign 43, TX control and operational module are that normal voice (SPEECH_GOOD) and VAD indicate that 43 put AMR-WB coded frame 41 47 li of information bits when being no speech and send to 3G Access Network (AN) at transmission types 46, DTX control and operational module are put the quiet description of adaptive multi-rate (AMR_SID) frame 45 47 li of information bits and are sent to 3G Access Network (AN) when transmission types indication 46 is quiet description renewal (SID_UPDATE), DTX control and operational module are that the SID_FIRST frame that quiet description is put according to 3GPP technical manual TS26093 formation 47 li of information bits when beginning (SID_FIRST) sends to 3G Access Network (AN) in transmission types indication 46, indication 3G Access Network did not carry out the transmission of Speech frame when TX control and operational module were no datat (NO_DATA) in transmission types indication 46, can so what no matter is put in information bit.
Receive the voice coding module after background noise code module sends pretreated speech digital signal frame 49, the ground unrest coding module produces the quiet description of AMR-WB (SID) frame 45, the description of the content of the ground unrest coding module among Fig. 4 and the TS26.192-500 of 3GPP is in full accord, so as long as just can realize the coding of ground unrest code rate AMR-WB frame with reference to this technical manual.
The voice coding module is to send AMR-WB coded frame 44 to DTX control and operational module when speech is arranged at VAD sign 43; Low rate voice coding module VAD sign 43 be no speech but TX_TYPE when being SPEECH_GOOD to DTX control and operational module transmission AMR-WB coded frame 41; The ground unrest coding module when TX_TYPE is not SPEECH_GOOD to the AMR-WB frame of DTX control and operational module transmission ground unrest code rate.
LSP index among Fig. 5, self-adapting code book index, self-adapting code book gain index, fixed code book index and fixed codebook gain index can be incorporated in the AMR-WB speech pattern coded frame 44; LSP index among Fig. 6, self-adapting code book index, self-adapting code book gain index, fixed code book index and fixed codebook gain index can be incorporated in the AMR-WB speech pattern coded frame 41.The form of coded frame 41 or 44 frame can be different with the form of the frame that provides among the TS26.101 of 3GPP, but the AMR-WB coded frame of 47 li of information bits meets the form of the frame that provides among the TS26.101 of 3GPP.

Claims (8)

  1. One kind can discontinuous transmission AMR-WB AMR-WB scrambler, in described AMR-WB scrambler, input signal frame is carried out linear prediction, determine transmission types TX_TYPE according to the voice activation testing result, determine the code rate of AMR-WB coded frame according to described voice activation testing result and described TX_TYPE, according to this code rate is described input signal frame coding AMR-WB coded frame, output type is the AMR-WB transmit frame of TX_TYPE, and generate the pumping signal of the described input signal frame of the next input signal frame that is used to encode, it is characterized in that
    Determine linear prediction synthesis filter by input signal frame being carried out the linear forecasting parameter that linear prediction obtains;
    According to a speech pattern code rate search of input signal frame self-adapting code book, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter are generated pumping signal, this pumping signal filtering is generated synthetic digital audio signal frame with described linear prediction synthesis filter;
    Obtain described voice activation testing result according to the voice activation detection that described synthetic digital audio signal frame is carried out;
    If described voice activation testing result is that speech is arranged, according to by a described speech pattern code rate input signal frame being carried out self-adapting code book search, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter is described input signal frame coding AMR-WB transmit frame, and, generate the pumping signal of described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of employed pitch delay, self-adapting code book in this coded frame;
    If described voice activation testing result is that no speech and described TX_TYPE are normal speech SPEECH_GOOD, by the lower speech pattern code rate of another speed is described input signal frame coding AMR-WB transmit frame, and, generate the pumping signal of described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of the pitch delay in this frame, self-adapting code book;
    Beginning SID_FIRST or quiet description renewal SID_UPDATE if described TX_TYPE is quiet description, is input signal frame coding AMR-WB transmit frame by the ground unrest code rate, and the pumping signal of described input signal frame is resetted;
    If described TX_TYPE is no datat NO_DATA, the pumping signal of described input signal frame is resetted.
  2. 2. according to the scrambler of claim 1, the device of quantification energy predicting error of four subframes that also comprises the needed input signal frame of speech pattern AMR-WB frame of a back input signal frame of determining that coding is adjacent with described input signal frame, it is characterized in that, this device is determined the quantification energy predicting error of four subframes of described input signal frame according to described voice activation testing result and transmission types signal TX_TYPE, promptly
    In described voice activation testing result is when speech is arranged, and this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of a described speech pattern code rate of described input signal frame;
    In described voice activation testing result is no speech and described transmission types signal when being normal speech SPEECH_GOOD, and this device generates the quantification energy predicting error of four subframes of described input signal frame according to given modifying factor in the AMR-WB coded frame of the lower speech pattern code rate of described another speed of described input signal frame;
    At described TX_TYPE is quiet description when beginning SID_FIRST or quiet description and upgrading SID_UPDATE, and the quantification energy predicting error of the subframe of the last input signal frame that this device will be adjacent with described input signal frame is as the quantification energy predicting error of the subframe of described input signal frame.
  3. 3. according to the scrambler of claim 1 or 2,
    Wherein the voice activation of carrying out detects the detection that the signal to noise ratio (S/N ratio) that comprises according to described synthetic digital audio signal frame determines whether sound.
  4. 4. according to the scrambler of claim 1 or 2, wherein the voice activation of carrying out detects and comprises:
    Determine amplitude threshold and scope according to described synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.Determine whether the detection of sound according to the signal to noise ratio (S/N ratio) of described synthetic digital audio signal frame.
  5. 5. speech pattern code rate AMR-WB Methods for Coding of the input signal frame in the input signal frame sequence being carried out self-adapting code book search, fixed codebook search and AMR-WB AMR-WB coding and a back input signal frame adjacent with this input signal frame being carried out non-ground unrest, it is characterized in that
    A described input signal frame is carried out linear prediction, and determine linear prediction synthesis filter according to resulting linear forecasting parameter, by speech pattern code rate to a described input signal frame self-adapting code book search for, fixed codebook search, and, this pumping signal filtering is generated synthetic digital audio signal frame with this linear prediction synthesis filter according to resulting self-adapting code book parameter and fixed code book parameter generation pumping signal;
    Carry out voice activation according to described synthetic digital audio signal frame and detect, determine the transmission types signal of discontinuous transmission according to this voice activation testing result;
    If described voice activation testing result is that speech is arranged, according to the described speech pattern code rate coding AMR-WB coded frame that is a described input signal frame, and, generate the pumping signal of a described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of employed pitch delay, self-adapting code book in this coded frame; If described voice activation testing result is that no speech and described transmission types signal are normal speech SPEECH_GOOD, the AMR-WB coded frame that described input signal frame coding is generated by lower another speech pattern code rate of speed, and, generate the pumping signal of a described input signal frame according to quantification gain, fixed code book signal, fixed codebook gain and the signal path parameter of the pitch delay in this frame, self-adapting code book; If being quiet description, the transmission types signal upgrades the quiet description renewal of the AMR-WB by the ground unrest code rate coding SID_UPDATE frame that SID_UPDATE then generates described input signal frame; Begin the quiet description that SID_FIRST then generates the AMR-WB of described input signal frame and begin the SID_FIRST frame if the transmission types signal is quiet description; If described transmission types signal is not SPEECH_GOOD, with a described input signal frame pumping signal reset;
    According to a described input signal frame pumping signal an adjacent back input signal frame is carried out the encoding operation of the voice mould pattern-coding speed of non-ground unrest.
  6. 6. according to the coding method of claim 5, it is characterized in that,
    If described voice activation testing result is that speech is arranged, according to the modifying factor correction factor generating quantification energy predicting error in the AMR-WB frame of the described speech pattern code rate of a described input signal frame;
    If described voice activation testing result is that no speech and described transmission types signal are normal speech SPEECH_GOOD, according to the modifying factor correction factor generating quantification energy predicting error in the AMR-WB frame of another lower speech pattern code rate of the described speed of a described input signal frame;
    Begin SID_FIRST or quiet description renewal SID_UPDATE or no datat NO_DATA if described transmission types signal is quiet description, the quantification energy predicting error of the subframe of last input signal frame that will be adjacent with a described input signal frame is as the quantification energy predicting error of the subframe of a described input signal frame.
  7. 7. according to the method for claim 5 or 6,
    Wherein carry out voice activation and detect the detection that the signal to noise ratio (S/N ratio) that comprises according to described synthetic digital audio signal frame determines whether sound according to described synthetic digital audio signal frame.
  8. 8. according to the method for claim 5 or 6, wherein carry out the voice activation detection and comprise according to described synthetic digital audio signal frame:
    Determine amplitude threshold and scope according to described synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.Determine whether the detection of sound according to the signal to noise ratio (S/N ratio) of described synthetic digital audio signal frame.
CNA2008100368357A 2008-04-30 2008-04-30 Self-adapting multi-rate broadband coding method and coder Pending CN101572091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100368357A CN101572091A (en) 2008-04-30 2008-04-30 Self-adapting multi-rate broadband coding method and coder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100368357A CN101572091A (en) 2008-04-30 2008-04-30 Self-adapting multi-rate broadband coding method and coder

Publications (1)

Publication Number Publication Date
CN101572091A true CN101572091A (en) 2009-11-04

Family

ID=41231424

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100368357A Pending CN101572091A (en) 2008-04-30 2008-04-30 Self-adapting multi-rate broadband coding method and coder

Country Status (1)

Country Link
CN (1) CN101572091A (en)

Similar Documents

Publication Publication Date Title
CN101359474A (en) AMR-WB coding method and encoder
US10249313B2 (en) Adaptive bandwidth extension and apparatus for the same
CN103325377B (en) audio coding method
CN1244907C (en) High frequency intensifier coding for bandwidth expansion speech coder and decoder
JP6752936B2 (en) Systems and methods for performing noise modulation and gain adjustment
RU2636685C2 (en) Decision on presence/absence of vocalization for speech processing
JP4302978B2 (en) Pseudo high-bandwidth signal estimation system for speech codec
CN105431903A (en) Audio decoding with reconstruction of corrupted or not received frames using tcx ltp
JP6262337B2 (en) Gain shape estimation for improved tracking of high-band temporal characteristics
CN102985968B (en) The method and apparatus of audio signal
EP2132733B1 (en) Non-causal postfilter
JP6469664B2 (en) Estimation of mixing coefficients for generating high-band excitation signals
CN104126201A (en) System and method for mixed codebook excitation for speech coding
US20060025991A1 (en) Voice coding apparatus and method using PLP in mobile communications terminal
CN101572090B (en) Self-adapting multi-rate narrowband coding method and coder
CN101388214A (en) Speed changing vocoder and coding method thereof
EP2132732B1 (en) Postfilter for layered codecs
CN100489966C (en) Method and device for coding speech in analysis-by-synthesis speech coders
EP1619665B1 (en) Voice coding apparatus and method using PLP in mobile communications terminal
CN101572091A (en) Self-adapting multi-rate broadband coding method and coder
Jage et al. CELP and MELP speech coding techniques
CN101609683B (en) Encoder and method for self adapting to discontinuous transmission of multi-rate narrowband
Srivastava et al. Performance evaluation of Speex audio codec for wireless communication networks
CN101609682A (en) A kind of scrambler and the method for the discontinuous transmission of AMR-WB
Yoon et al. An efficient transcoding algorithm for G. 723.1 and G. 729A speech coders: interoperability between mobile and IP network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20091104