CN1327408C

CN1327408C - A low bit-rate speech encoder

Info

Publication number: CN1327408C
Application number: CNB2004101032190A
Authority: CN
Inventors: 董恩清
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2004-12-31
Filing date: 2004-12-31
Publication date: 2007-07-18
Anticipated expiration: 2024-12-31
Also published as: CN1632861A

Abstract

The present invention discloses a speech encoder, particularly a speech encoder with a low bit rate, which bases on local cosine transform. The present invention is suitable for a communication system which requires speech encoding with a low bit rate, applies a shaping function which can be flexibly adjusted and uses the shaping function for shaping to a clock function which is used by Donoho so as to obtain a new clock function which can enhance the aggregation of spectral energy; a fractal dimension quantizing method is used for the encoding of local cosine transform coefficients, and an LGB method is used for designing an encoding book for each dimensional vector; the search of the encoding book in the encoding adopts a searching method with a tree structure so as to realize the speech encoder with a good low bit in a local cosine transform region. The encoder has better natural degree and better intelligibility than an LPC-10e encoder, which is indicated by objective parameter evaluations and informal listening tests.

Description

A kind of low bit-rate speech encoder

Technical field

The present invention relates to a kind of speech coder, particularly a kind of (LocalCosine Transform, low bit-rate speech encoder LCT) are suitable for using in the communication system of requirement low bit rate speech coding based on Local Cosine Transform.

Background technology

Low bit rate speech coding became a main research theme in the past in 20 years, and the result causes with a lot of speech coding algorithm standardization of bit rate from 16kb/s to the 2.4kb/s scope.The emphasis of speech coder research at present is at 4kb/s and lower high-quality speech coding.Though the CELP wave coder still can produce high-quality voice when being lower than the 6.3kb/s bit rate, but when bit rate is reduced to 4kb/s and when lower, owing to there are not enough bits that the waveform details is encoded, the waveform coding system will produce a large amount of quantizing noises.On the other hand, parameter coding (yet claiming vocoder) does not attempt to produce the waveform similar to original signal, and for it, attempts to find one group of parameter can representing the speech perception important attribute preferably, but they are relatively poor to the robustness of various particular surroundings noises.

Yet for reaching the more voice coding of low bit rate at the 4kb/s bit rate, voice coding has better voice quality than already present scrambler based on CELP potentiality are carried out in nearest studies show that at frequency domain.The spectral encoding device is attempted reconstruct voice spectral amplitude rather than is accurately recovered speech waveform.Although above scrambler is widely used in low bit rate speech coding, the restriction of the model accuracy that they are supposed mostly also has them mainly to depend on correct parameter estimation, and often these requirements are difficult to be guaranteed.So under particular surroundings, the robustness of these coding methods is very poor, the voice quality behind the coding has certain limitation.

By (1992) such as Coifman and Meyer (1991) and Auscher successively the local cosine base of structure by smoothly, tightly supporting the clock function and the cosine function product constitutes.The cosine function of these localization is still keeping orthogonality, and has less Heisenberg product.In the last few years, the Local Cosine Transform theoretical method obtained research extensively and profoundly, and this method is used more in image compression encoding, and it is less relatively to be applied in the research of voice signal processing aspect, particularly was applied in the voice coding then still less.But at MalvarH.S. " Lapped transforms for efficient transform/subband coding " .IEEETrans.on Acoust., Speech Signal Processing, 1990., vol.38 (6), Page (s): the coding gain that has proved LCT method in voice coding in the document that 969-978 delivers is better than the DCT coding, and very approaches the KL transition coding.Particularly compare, obviously reduced " snap " sound between the frame, do not need to resemble dct transform and be coded in the method that often adopts the long slip of field in the coding in order to reduce unusual " snap " sound that occurs between the frame with the DCT coding method.So this LCT method is than the calculated amount of DCT method minimizing near half.Document " Comparison of picture compression methods:wavelet; the waveletpacket and local cosine " .Wavelets:Theory that delivered in 1994 at Wickerhauser M.V., Algorithms, and Applications, Editor (Charles K.Chui and Laura Montefusco and Luigia Puccio), Academic Press, San Diego, California, p.585～621, in several two dimensional image coding method comparative studies of carrying out shown that also the LCT method is better than the DCT method on coding gain, and equally also be in close proximity to the KL transform method.Studies show that, the key that improves the coding gain of transition coding is choosing of orthogonal basis, equally, the key in Local Cosine Transform coding also is choosing of local cosine orthogonal basis, is choosing of clock function and influence the principal element that local cosine orthogonal basis chooses.More than a spot ofly the LCT method be applied to research in the voice coding just rest on simple coding gain contrast, feasible speech coder of inreal design.

Summary of the invention

The objective of the invention is to utilize Local Cosine Transform to have the characteristics of higher coding gain, a kind of good low bit-rate speech encoder practical in the Local Cosine Transform territory is provided.

The technical scheme that realizes the object of the invention is: a kind of low bit-rate speech encoder, it is based on Local Cosine Transform, by the high-pass filtering pretreater primary speech signal of input coding device is handled, carry out Local Cosine Transform (LCT) then and handle, it is characterized in that: the clock function b in the described LCT conversion _New(n) meet following condition:

b_{new} (n) = \{\begin{matrix} \sin [πx (n) / 2] \cdot ξ_{[n]} (n) & 1 \leq n \leq m \\ 1 & m + 1 \leq n \leq 3 m \\ {1 - [\sin (πx (n - 3 m) / 2) \cdot ξ_{[n]} (n - 3 m)]}^{1 / 2} & 3 m + 1 \leq n \leq 4 m \end{matrix},

M=80 wherein,

ξ _[n](n) the shaping function for adopting is eligible

{ξ_{[n + 1]}}^{def} = ξ_{[n]} [\sin (πt / 2)]

With

ξ_{[0]} {(t)}^{def} = ξ (t),

Wherein:

ξ (t) = \{\begin{matrix} 0 & t \leq - 1 \\ \sin [π (1 + t) / 4] & - 1 < t < 1 \\ 1 & t &GreaterEqual; 1 \end{matrix}

Subscript n is the iterations of this shaping function; The clock function is value on the width of 1～4m.

Described clock function b _New(n) guarantee to multiply each other and form a local cosine orthogonal basis with cosine function.

The iterations n of described shaping function is 8～10.

LCT coefficient after each frame process LCT conversion, earlier by difference 40 from the low frequency to the high frequency, 40,40,20 carry out the division of branch n dimensional vector n dimension, utilizing four different branch n dimensional vector n quantization code books to carry out the branch n dimensional vector n again quantizes, bit from first n dimensional vector n to the fourth dimension vector assignment is followed successively by 12 respectively, 12,8,8 bits, the gain quantization of each frame adopts 8 bit scalar quantizations, according to from four fens n dimensional vector n bits of first fen n dimensional vector n bit to the, the order output bit of gain quantization bit is 48 bits, with the bit stream of each frame output of 6 byte representations.

Described speech coder also has a Voice decoder with its coupling.

The present invention utilizes this shaping function that the clock function that Donoho adopts is carried out shaping owing to used the shaping function that can adjust flexibly, obtains a new clock function that can improve the spectrum energy aggregation; To the coding of local cosine transform coefficient, adopt and divide the dimension quantization method, for each n dimensional vector n, all adopt LGB method design code book; The tree structure searching method is adopted in the search of code book in the coding, has realized a kind of good low bit speech coder that has in the Local Cosine Transform territory.Show that through objective parameter evaluation and unofficial hearing test this scrambler has better naturalness and intelligibility than LPC-10e scrambler, it is suitable for the voice coding under the various environment.

Description of drawings

Fig. 1 is the curve map that the shaping function in the embodiment of the invention speech coder changes along with the recurrence number of times;

Fig. 2 be after the shaping of being adopted in the embodiment of the invention speech coder the clock function along with the increase of recurrence number of times low half frequently energy increase chart of percentage comparison (English+Chinese);

Fig. 3 is the structural representation of embodiment of the invention speech coder;

Fig. 4 is the structural representation of embodiment of the invention Voice decoder;

Embodiment

Below in conjunction with drawings and Examples, technical solutions according to the invention are further elaborated.

Referring to accompanying drawing 3, accompanying drawing 4, provide the structural representation of described low bit rate encoder of present embodiment and demoder in the accompanying drawing respectively.

The gordian technique of the embodiment of the invention is:

One, the acquisition of the clock function after the optimal shaping

Among Fig. 3, the primary speech signal of input coding device is carried out the high-pass filtering pre-service, carry out the LCT conversion process then, in the LCT conversion, the present invention adopts the clock function after the shaping to be:

b_{new} (n) = \{\begin{matrix} \sin [πx (n) / 2] \cdot ξ_{[n]} (n) & 1 \leq n \leq m \\ 1 & m + 1 \leq m \leq 3 m \\ {1 - [\sin (πx (n - 3 m) / 2) \cdot ξ_{[n]} (n - 3 m)]}^{1 / 2} & 3 m + 1 \leq n \leq 4 m \end{matrix},

M=80 wherein,

Clock function after the above-mentioned shaping is obtained by following steps:

1, adopt the clock function of Donoho:

When Wickerhauser M.V. set forth the Local Cosine Transform algorithm in the monograph of publishing in 1994, the clock function that provides was for given I _jAnd r, then bell shaped function is changeless.

Provide the clock function simple structure process that Donoho adopts below.If I _j=2m, r=m, then bell window width is 4m, order

t(n)＝n-0.5， 1≤n≤m. (1)

x(n)＝(1+t(n)/m)/2 (2)

So, the bell window function of Donoho employing is:

b (n) = \{\begin{matrix} \sin [πx (n) / 2] & 1 \leq n \leq m \\ 1 & m + 1 \leq n \leq 3 m \\ {[1 - \sin^{2} (πx (n - 3 m) / 2)]}^{1 / 2} & 3 m + 1 \leq n \leq 4 m \end{matrix} - - - (3)

2, being configured to of shaping function:

The order real-valued sequence t of input (n) is

t(n)＝[2(n-1)-m+0.5]/2m， 1≤n≤m (4)

Define a real-valued continuous function

ξ (t) = \{\begin{matrix} 0 & t \leq - 1 \\ \sin [π (1 + t) / 4] & - 1 < t < 1 \\ 1 & t &GreaterEqual; 1 \end{matrix} - - - (5)

Repeat to replace t for following formula,, can obtain d times continuously differentiable function (ξ ∈ C for arbitrarily big fixed integer d with sin (π t/2) ^d).Be defined as follows recursive function

ξ_{[0]} \overset{def}{=} ξ (t) - - - (6)

ξ_{[n + 1]} \overset{def}{=} ξ_{[n]} [\sin (πt / 2)] - - - (7)

Wherein the subscript of ξ is represented the recurrence number of times.It will be appreciated that ξ by recurrence _[n](t) on t=+1 and t=-1 point 2 ⁿ-1 order derivative is 0, also promptly means

ξ_{[n]} &Element; C^{2 n - 1} .

As Fig. 1 is several recurrence result curve of this shaping function, m=80 here.

3, asking for of the clock function after the shaping:

Produce various shaping functions by changing the recurrence number of times, utilize the shaping function ξ after recurrence n time _[n](t) the clock function in (6.3) formula is carried out shaping and obtain the clock function new as the next one

b_{new} (n) \{\begin{matrix} \sin [πx (n) / 2] \cdot ξ_{[n]} (n) & 1 \leq n \leq m \\ 1 & m + 1 \leq n \leq 3 m \\ {1 - [\sin (πx (n - 3 m) / 2) \cdot ξ_{[n]} (n - 3 m)]}^{1 / 2} & 3 m + 1 \leq n \leq 4 m \end{matrix},

M=80 wherein, (8)

Clock function in the following formula is guaranteed to multiply each other with cosine function and is formed a local cosine orthogonal basis.

In practical problems, need on a fixing window width, ask for best orthogonal basis.Just require the clock function that can adjust flexibly of design to satisfy the needs of practical problems.In the present embodiment, the technical scheme of employing is voice signal to be carried out decorrelation go redundancy, and purpose is to make fixedly that frame length voice signal spectrum energy concentrates in some frequency bands preferably, is convenient to subband coding.For this reason, the shaping methods that the embodiment of the invention provided is the shaping function that can carry out flexible shaping to the clock function that Donoho adopts, and therefrom chooses the shaping function that is suitable for the frequency field voice coding, and then obtains best clock function.

4, determining of best bell shaped function:

Will relate to the practical problems of transform domain voice coding in the embodiment of the invention, what need to solve is that bell shaped function that the shaping function determining to carry out forming after the how many times recurrence adopts Donoho carries out shaping and the clock function that obtains is the most suitable.In the present embodiment, is frame length 20ms, sampling rate is that the frequency band division of the voice signal of 8kHz is two frequency bands of height, the purpose of shaping clock function is that the requirement spectrum energy concentrates in the low half bigger frequent band of quantity of information as far as possible, is convenient to the back face code carries out number of bits to the spectral coefficient of high and low half frequency band optimized distribution.

Referring to accompanying drawing 2, the embodiment of the invention adopts English and Chinese speech to test and the variation along with the recurrence number of times that obtains, utilizes the clock function after the shaping to carry out the recruitment that the spectrum energy of low half frequency band after the Local Cosine Transform accounts for total spectrum energy number percent than the clock function that adopts Donoho.As seen from Figure 2, spectrum energy increases maximumly when recurrence 9 times, and therefore, the embodiment of the invention selects the shaping function of 9 recurrence to carry out shaping.Though the ratio that spectrum energy increases is less, has illustrated that adjusting suitable clock function can change the spectrum energy aggregation extent, when being convenient to encode to the allocation optimized of bit.

Two, divide the n dimensional vector n quantization method

Say that roughly preceding four resonance peaks of adult's voice signal lay respectively at 500Hz, 1500Hz, 2500Hz and 3500Hz.In fact this be divided into voice signal four important areas, requires us when coding the spectrum in these four zones to be treated with a certain discrimination.Parameter for transform domain is encoded, and adopt mostly and divide n dimensional vector n to quantize (Splitted Vector Quantization) method, so, in the embodiment of the invention, the method that designed scrambler takes the branch dimension to quantize the coefficient of Local Cosine Transform.Carry out the code book training respectively for each n dimensional vector n.When utilizing LGB algorithm generated code postscript, the search speed of code book adopts the tree code book search method when improving encoding and decoding.

When a minute dimension quantized, the conversion coefficient number of each n dimensional vector n was divided and be respectively 40,40,40,20 from the low frequency to the high frequency.We are called first n dimensional vector n, second n dimensional vector n, third dimension vector and fourth dimension vector to these four vectors.Because to sampling rate is the voice signal of 8kHz, only keep the voice signal that the following spectrum composition of 3500Hz just is enough to recover preferably satisfactory quality.In order to reduce computation complexity, the fourth dimension vector is only used 20 coefficients.During inverse transformation synthetic speech signal in demoder, 20 coefficients that remain radio-frequency component are filled to 0.

In embodiments of the present invention, the bit distribution is that the number of bits that each n dimensional vector n from the low frequency to the high frequency distributes is respectively 12,12,8,8.The gain of speech coder be four code vectors of search when adopting the input signal spectrum energy with coding spectrum energy and ratio calculate and get.8 bit mark quantization methods are adopted in the quantification of gain.The total Bit Allocation in Discrete of the every frame of the scrambler that designs in the embodiment of the invention is as shown in table 1.

The speech coder input speech signal is that sampling rate is the voice signal of 8kHz 16 bit PCM forms.What present embodiment adopted is the speech data of wav form, so level magnitude has been normalization.System does not have special requirement to the kind of voice, is suitable for the voice coding of various languages.

Evaluation to the described scrambler of the embodiment of the invention:

1, objective evaluation

Other the normalized scrambler that is adopted when carrying out test comparison with the described scrambler of inventive embodiments has G.729 Annex B (G.729B), GSM Half-Rate, FS1016, FS1015 (LPC-10e).The parameter that objective evaluation adopts have signal to noise ratio (S/N ratio) (Signal to Noise Ratio, SNR) and Y-PSNR (Peak Signal to Noise Ratio, PSNR):

SNR = 10 \log_{10} \frac{(σ_{x}^{2})}{(σ_{e}^{2})} - - - (9)

Here σ _x ²Be all sides of voice signal, σ _e ²All sides for the voice signal difference of primary speech signal and reconstruct.

PSNR = 10 \log_{10} \frac{N X^{2}}{{| | x - \tilde{x} | |}^{2}} - - - (10)

Here N is the length of reconstruction signal, and X is in length being the absolute value maximal value in the signal x of N, Quadratic sum for difference between original signal and the reconstruction signal.

As everyone knows, the voice signal behind the coding is carried out objective evaluation and obtain the result that passes comprehension sometimes.Even the voice after encoder encodes have high s/n ratio, sometimes may its voice quality not necessarily than the voice quality height that produces low signal-to-noise ratio after another encoder encodes.On the contrary, equally also set up.The objective parameter evaluation can not be as the leading indicator of speech coder performance evaluation thus, and it can only be as an auxiliary evaluation.

Table 2 for present embodiment speech coder (FBR-LCT) with G.729B, the result that compares of GSM Half-Rate, FS1016 and FS1015 coding standard.This result has also illustrated the reliability of method for objectively evaluating in speech coder performance evaluation.G.729B, GSM Half-Rate and FS1016 all belong in the coding standard of low bit rate, the voice quality of their codings is considerably beyond FS1015 and LCT coding method, but from these two indexs, comparatively speaking the LCT method has quite high advantage.Compare with the FS1015 scrambler of identical bit, show that the SNR of LCT coding method and PSNR obviously exceed nearly 5dB at most than the SNR and the PSNR of FS1015 standard.

The coding method that embodiment of the invention scrambler is adopted is carried out in transform domain, its essence is the category of waveform coding.So utilize two evaluation indexes of SNR and PSNR to carry out objective evaluation, be favourable to it.So, say that objectively list can not say something scrambler evaluation from several objective indicators, can only be as a reference.

2, subjective assessment:

The final accepting object of voice that speech coder produces is people's a ear, so the voice quality quality behind the coding mainly is acceptor's sense of hearing perception evaluation.The evaluation of voice quality is carried out in the unofficial voice hearing test of general employing.

For the clear voice of noiseless, the voice of the LCT coding method reconstruct (FBR-LCT) that the embodiment of the invention adopted have slight bluring, so it is loud and clear to can't hear the voice that resemble LPC-10e reconstruct.The speech intelligibility height that do not have G.729B, GSM Half-Rate and FS1016 coding standard produces, but its property understood and naturalness are good, and obvious better than LPC-10e method with bit rate.The LCT coding method has stronger robustness, and its coding distortion is along with the change of signal is insensitive, even to G.729B, signal that GSM Half-Rate, FS1016 and LPC-10e method are invalid is still very stable.When using background music or other non-speech audio, the FBR-LCT coding method is obviously better than LPC-10e method.These are because the LCT coding method belongs to the waveform coding in transform domain, so it does not rely on as speech characteristic parameters such as fundamental tones fully.On the contrary, G.729B, GSM Half-Rate, FS1016 and LPC-10e be based on the estimation of speech source-filtering generation model and linear forecasting parameter, and be responsive especially to accuracy of parameter estimation.Of the present invention also can realize by software emulation based on the Local Cosine Transform low bit rate encoder.

Table 1

Divide n dimensional vector n				Gain (bit)	Frame (bit)
Divide n dimensional vector n						First n dimensional vector n	Second n dimensional vector n	Third dimension vector	The fourth dimension vector
12	12	8	8			First n dimensional vector n	Second n dimensional vector n	Third dimension vector	The fourth dimension vector	8	48

Table 2

The scrambler class	English		Chinese		Chinese+background music	Bit rate (kb/s)
	English		Chinese		Chinese+background music		SNR(dB)	PSNR(dB)	SNR(dB)	PSNR(dB)	SNR(dB)PSNR(dB)
	G.729 Annex GSM Half-Rate FS1016 FS1015(LPC10e) FBR-LCT	-.95 -1.24 0.71 -3.59 -0.44	15.08 14.81 16.74 12.47 15.08	-1.46 -0.82 1.37 -2.65 0.26	18.32 19.46 21.63 17.64 20.54		SNR(dB)	PSNR(dB)	SNR(dB)	PSNR(dB)	SNR(dB)PSNR(dB)	-1.18 15.58 -0.74 16.09 1.27 18.09 -1.80 15.02 -1.07 15.75	8 5.6 4.8 2.4 2.4

Claims

1. low bit-rate speech encoder, it is based on Local Cosine Transform, by the high-pass filtering pretreater primary speech signal of input coding device is handled, carried out Local Cosine Transform then and handle, it is characterized in that: the clock function b in the described Local Cosine Transform _New(n) meet following condition:

b_{new} (n) = \{\begin{matrix} \sin [πx (n) / 2] \cdot ξ_{[n]} (n) & 1 \leq n \leq m \\ 1 & m + 1 \leq n \leq 3 m \\ {1 - [\sin (πx (n - 3 m) / 2) \cdot ξ_{[n]} (n - 3 m)]}^{1 / 2} & 3 m + 1 \leq n \leq 4 m \end{matrix},

M=80 wherein,

ξ _[n](n) the shaping function for adopting is eligible

ξ_{[n + 1]} \overset{def}{=} ξ_{[n]} [\sin (πt / 2)]

With

ξ_{[0]} (t) \overset{def}{=} ξ (t),

Wherein:

ξ (t) = \{\begin{matrix} 0 & t \leq - 1 \\ \sin [π (1 + t) / 4] & - 1 < t < 1 \\ 1 & t &GreaterEqual; 1 \end{matrix}

Subscript n is the iterations of this shaping function; The clock function is value on the width of 1～4m;

The Local Cosine Transform system that obtains is carried out branch n dimensional vector n quantification treatment, each frame is through the Local Cosine Transform coefficient after the Local Cosine Transform, earlier by difference 40 from the low frequency to the high frequency, 40,40,20 carry out the division of branch n dimensional vector n dimension, utilizing four different branch n dimensional vector n quantization code books to carry out the branch n dimensional vector n again quantizes, bit from first n dimensional vector n to the fourth dimension vector assignment is followed successively by 12 respectively, 12,8,8 bits, the gain quantization of each frame adopts 8 bit scalar quantizations, according to from four fens n dimensional vector n bits of first fen n dimensional vector n bit to the, the order output bit of gain quantization bit is 48 bits, with the bit stream of each frame output of 6 byte representations.

2. a kind of low bit-rate speech encoder according to claim 1 is characterized in that: described clock function b _New(n) guarantee to multiply each other and form a local cosine orthogonal basis with cosine function.

3. a kind of low bit-rate speech encoder according to claim 1 is characterized in that: the iterations n of described shaping function is 8～10.

4. a kind of low bit-rate speech encoder according to claim 1 is characterized in that: described speech coder also has a Voice decoder with its coupling.