WO2009059564A1

WO2009059564A1 - A multi-rate speech audio encoding method

Info

Publication number: WO2009059564A1
Application number: PCT/CN2008/072946
Authority: WO
Inventors: Zexin Liu; Fuwei Ma; Wei Xiao
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2007-11-05
Filing date: 2008-11-05
Publication date: 2009-05-14
Also published as: CN101430879B; CN101430879A

Abstract

A multi-rate speech audio encoding method includes: calculating a difference value signal between a synthetic speech obtained by decoding the coded input signal and the input signal; calculating a index value corresponding to the nearest lattice from each frequency spectrum vector corresponding to the difference value; calculating first ratio value of the perception weighting filter corresponding to the difference value signal to the frequency spectrum of the synthetic speech; incorporating the index corresponding to the lattice to the code stream according to the degressive order of the first ratio value (212).

Description

A multi-rate speech and audio coding method. The present application claims the priority of a Chinese patent application filed on November 5, 2007, with the application number 200710165110.3, and the invention is a multi-rate speech and audio coding method, the entire contents thereof. This is incorporated herein by reference.

Technical field

The present invention relates to coding techniques, and more particularly to a method of multi-rate speech and audio coding.

Background technique

In the current multi-rate speech and audio coding, when the input signal is more in line with the musical characteristics or the code rate is higher (for example, in AMR-WB+, G.729.1, and G.VBR), multi-purpose transform domain coding is used. That is, the time domain signal is transformed into the frequency domain by a transform method, for example, a modified discrete cosine transform (MDCT) or a fast Fourier transform (FFT) transform. When the parameters of the transform domain coding are quantized, the lattice vector quantization technique is used. FIG. 1 is a flow chart of a prior art lattice vector quantization method. As shown in Figure 1, in general, lattice vector quantization includes the following steps:

Step 101: Find a corresponding grid point according to the principle of proximity to the frequency domain signal.

That is, the input spectrum vector is found according to the principle of proximity, specifically, in the 8-dimensional Gosset dot matrix (called RE8 grid) or Z8 grid or Z16 grid, etc., the grid point Ck closest to it is found; Step 102: Determine an index value of the corresponding grid point according to the size of the spectrum energy corresponding to each grid point and the total number of bits.

Wherein, when Ck is in the basic codebook, the index value includes the codebook index value nk and the corresponding codeword index value Ik; when Ck is not in the basic codebook, Voronoi is extended for Ck, and the index value includes the codebook at this time. In addition to the index value nk and the codeword index value Ik, an index kv of the extended codebook is also included.

Step 103: When the number of bits is insufficient, the elements of the grid points corresponding to the less energy frequency are all forcibly set to zero.

In step 104, the index values of the grid points are written into the code stream in order from low frequency to high frequency. Step 105: At the decoding end, the quantized spectrum sequence is sequentially decoded from the low frequency to the high frequency according to the decoded index value. It can be known from the masking effect of the signal that the powerful signal can mask the signal with small sound intensity around it, so that the human auditory system can not feel the existence of the masked signal; at the same time, the human auditory system itself also has the function of signal masking. That is, when the signal sound intensity is less than a certain threshold, even if there is no masking of other signals, the human auditory system does not feel the existence of the signal, and such masking is called absolute masking. It can be seen from the test that the absolute masking domain value decreases with increasing frequency in the range of 0 to 500 Hz, and the absolute masking domain value is almost unchanged in the range of 500 to 5000 Hz.

When performing lattice vector quantization on the spectrum of the input signal, some elements of the grid (for example, the quantized value corresponding to the spectrum of the input signal) may be forced to zero due to the limitation of the total number of bits. At this time, if you set some grid points corresponding to the frequency module with important information to 0, the quality of the code will be greatly reduced. Therefore, it is necessary to set which grid points are set to 0 according to a decision, and the criteria for which grid points are retained.

In the above encoding method, since the encoding side encodes the lattice points of the transform domain in a storage according to the order of the spectral energy from the largest to the smallest, in the storage, the index value of each lattice point is solved from first to last. When the number of bits is insufficient, the elements of the grid corresponding to the spectrum with relatively small spectral energy (ie, the grid points placed at the relatively later position) are all forcibly set to 0, and their index values are also obtained. 0. The parameters of the transform domain coding at the encoding end are that all index values (index values of 0 and not 0) are written in the code stream in low frequency to high frequency order, so when the number of bits is insufficient, at the decoding end, the code is Only a small amount of low frequency information can be recovered in the stream, and some of these low frequency information is set to 0 spectrum, which causes some elements of the grid to be not set to 0, but there are not enough bits to encode, so that when When a part of the bits is added to increase the code rate, the quality of the output voice or audio signal is not significantly improved. In addition, when the index value of the grid point is obtained, the importance of the grid point is determined according to the magnitude of the spectrum energy. When the number of bits is insufficient, the element of the grid point with the smaller spectral energy is set to 0; Not necessarily an important component, such a decision criterion may set the element of the grid corresponding to some important components to 0, affecting the quality of the output signal. In addition, according to the principle of masking effect, whether the signal is masked or not depends on the magnitude of the spectral energy, and to some extent, depends on the difference between the masked signal and the masked signal, and there is no such coding method. Consider this difference. Summary of the invention

In view of this, the main purpose of the embodiments of the present invention is to provide a multi-rate speech and audio coding. The method thereby improving the quality of the output speech (audio) signal when the bits in the transform domain coding are insufficient. To achieve the above objective, the technical solution in the embodiment of the present invention is implemented as follows:

A method for multi-rate speech and audio coding, the method comprising:

Solving a difference signal of the synthesized speech obtained by encoding and then locally decoding the input signal and the input signal;

And an index value corresponding to the nearest lattice point of each spectrum vector corresponding to the difference signal is solved;

Solving a first ratio of a spectrum of the perceptual weighting filter corresponding to the difference signal to a spectrum of the synthesized speech;

The index values corresponding to the grid points are programmed into the code stream according to the order in which the first ratio is large to small. In summary, a method for multi-rate speech and audio coding is provided in an embodiment of the present invention. By using the method in this embodiment, the difference signal between the synthesized speech obtained by encoding and then locally decoding the input signal and the input signal is solved in advance, and then the respective spectral vectors corresponding to the difference signal are obtained nearest to each other. a corresponding index value of the grid point, and a first ratio of the spectrum of the perceptual weighting filter corresponding to the difference signal to the spectrum of the synthesized speech, and finally according to the order of the first ratio from large to small The corresponding index value of the grid point is programmed into the code stream, and according to the function of the perceptual weighting filter, the more important information is finely quantized and preferentially coded into the code stream, and the unimportant information is roughly quantized, so that the decoding end When the number of bits is insufficient, more important information can be decoded, thereby improving the quality of the decoded speech.

DRAWINGS

1 is a flow chart of a lattice vector quantization method in the prior art;

2 is a flowchart of a multi-rate speech and audio encoding method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a multi-rate speech and audio encoding method according to another embodiment of the present invention.

detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

FIG. 2 is a flowchart of a multi-rate speech and audio encoding method according to an embodiment of the present invention. Specifically, this is The example includes the following steps:

Step 201: The encoder receives the input signal.

Step 202: The encoding end performs CELP (Code Excited Linear Pre-diction) encoding on the input signal. The coding method is not limited to CELP, but other methods can be used.

In the CELP encoding process, the input signal can be encoded in two layers (for example, L1 layer and L2 layer). In the present invention, the input signal may be encoded without being layered or divided into one or more layers. How many layers are encoded, that is, how many different rates of speech coding can be achieved. The number of specific layers can be made according to actual needs.

Step 203: Perform local decoding on the CELP-encoded signal to obtain a decoding signal.

Step 204: Solve the spectrum according to the decoded signal obtained in step 203. Specifically, the synthesized speech of the first two layers (the spectrum Freq_R2 of the audio M-word) in the decoded signal is solved according to the decoded signal. How many layers are programmed in step 202, and how many layers should be solved in this step.

Step 205, solving a difference signal between the input signal and the synthesized speech signal of the first two layers of the decoded signal obtained in step 203, and solving the MDCT coefficient of the difference signal, that is, the spectrum Freq_err.

Step 206: After performing the operation (for example, rounding operation) on the MDCT coefficients in step 205, a plurality of grid points closest to the corresponding spectrum vector distance are obtained. If you use RE8, you can get 35 grid points.

In step 207, the frequency spectrum Freq_R2 in step 204 and the Freq_err difference value in step 205 are transformed with the Freq_err to obtain a ratio Ratio[k], and each grid point in step 206 is divided into N regions in the order of the preceding and succeeding. Taking RE8 as an example, the formula for solving is as follows:

Ratioik] = Υ{ Freq _R2[l] -Fre _q _ err[l _{{ χ )}

, { Freq _R2[l] )

Where l = 8*k+i, k=0, l, 2, ..., 34, i=0, l, 2, ..., 7.

By solving with formula (1), you can get an array of multiple ratios

Ratio [k] , each ratio in the array uniquely corresponds to a grid point in step 206.

If the RE8 cell is used, an array Ratio[k] consisting of 35 ratios can be obtained. The obtained plurality of lattice points are divided into N regions in the order of the order in the array Ratio[k], and N is an integer greater than or equal to 1.

Step 208, sort the grid points.

The lattice points in the above N regions are arranged according to the auditory characteristics of the human ear, and the signal with a small possibility of masking is placed in front. In this embodiment, the cells are arranged in reverse order, the grid points in the last region of the array Ratio[k] are arranged at the forefront, the grid points in the first region are arranged at the end, and the other regions are arranged accordingly. Then arrange the grid points in each area in the order of Ratio[k] from largest to smallest. Place the reordered grid points in a new array in order.

The following is a specific example. The ratio[k] is divided into two regions, the first n ratio is the first region, and the later (35-n) ratio is the second region. Where n is an integer greater than or equal to 1. According to the value of Ratio[k] in the second region, the grid points in step 206 are sorted in descending order, placed in front of an array of 35 elements R[k] (35-n) Among the elements; the first area is similar to the second area, and the grid points in step 306 are sorted according to the value of Ratio[k] in the first area in descending order. In the last n elements of the array R[k].

When the ratio[k] is divided into two regions in this embodiment, how many values are taken as the first region should be preset according to the actual application. Multiple regions are sorted according to the coding characteristics of the first few layers of the encoder or the characteristics of the MDCT coefficients of the difference signal, and the region where the corresponding lattice point of the signal with low probability of being masked is ranked first. The area where the corresponding grid point of the signal that is likely to be masked is ranked last.

Step 209: Solving the spectrum W_Freq of the perceptual weighting filter corresponding to the difference signal. The perceptual weighting filter H(z) satisfies the formula: H(z) = ^^/ /^-^- ¹ ). Where A represents a linear prediction coefficient, reflecting the spectral envelope value of the high frequency band, z represents the frequency domain, and β and γ represent weighting factors, which are generally constant.

The spectrum W-Freq is obtained by MDCT transform based on the perceptual weighting filter Η(ζ).

Step 210: Convert the spectrum Freq_R2 in step 204 and W_Freq in step 209 to obtain a ratio Rat[k], and the formula is as follows: Rat[k] =

Where l = 8*k+i, k=0, l, 2, ..., 34, i=0, l, 2, ..., 7.

Step 211: Solve the index value of each grid point. According to the data in the array in the above step S208 and the total number of bits that can be utilized, the index value of the corresponding grid point is obtained. When the number of bits is insufficient, the elements in the last m grid points in the array in step 208 are set to 0, and the index values of these grid points are also 0. m is a predetermined integer greater than or equal to 1. Among them, the value of m can be set in advance according to the total number of bits.

Steps 210 and 211 enumerated in this embodiment are not limited to the above sequence, and step 211 may be operated first, and then step 210 is operated.

Step 212: Write the index values of the corresponding grid points into the code stream according to the order of the Rat[k] values from large to small. The index value of the corresponding grid point of the Rat[k] value (the more important signal) is first written into the code stream, and the index value of the corresponding grid point with a small Rat[k] value is written into the code stream.

Of course, the invention is not limited to the above embodiments. The present invention is not limited to the RE8 grid used in the present embodiment, and other methods such as Z8 grid can be used. Sorting the grid points in step 208 may also be based on other principles. For example, the grid order may be determined by using the global index order determination mode, that is, all the grid points are not divided into regions, and only the size of Ratio[k] is performed. arrangement. The specific need to use the scheme to sort the grid points, you can choose according to actual needs. The above steps S201 to S212 in this embodiment are not limited to the above order.

From the above, in the present embodiment, the decision criterion represented by the formula (1) is more in line with the principle of the masking effect: According to the principle of the masking effect, if the difference between Freq_R2 and Freq_err is smaller, then their The closer the frequency is, the smaller the possibility that Freq_err is masked off. In addition, in the case where the above difference is the same, the larger the ratio of the above difference to the locally decoded Freq-R2, the possibility that Freq_err is masked. The sex is smaller. Therefore, by the method in this embodiment, it can be ensured that the guiding item corresponding to the signal that is less likely to be masked is not forcibly set to 0, thereby ensuring that when the number of bits is insufficient, the more important information will be compared. Finely quantized and prioritized into the code stream, not important The information will be roughly quantified.

In addition, in this embodiment, the pilot index value is determined in the code stream according to the locally decoded CELP-encoded synthesized speech (audio) spectrum Freq_R2 and the ratio of the spectrum W-Fre of the perceptual weighting filter. The reason for the order is: According to the function of the perceptual weighting filter, a larger distortion is allocated at a larger spectral energy of the input signal, and a distortion is minimized at a smaller spectral energy, so that for a CELP-encoded signal, This will result in a relatively coarse quantization at a larger Rat, which is the focus of the lattice vector quantization in this embodiment. Therefore, by placing the index value of those boots with a larger Rat at the earlier position of the code stream, and placing the index value of those guide items with a smaller Rat at the later position of the code stream, the method can be made. When the number of bits at the decoding end is insufficient, more important information can be decoded, thereby improving the quality of the decoded speech.

In another embodiment of the invention, the order of the trellis codebooks in the codestream can also be determined by a determined spectral value as a demarcation point. FIG. 3 is a flowchart of a multi-rate speech and audio encoding method according to another embodiment of the present invention. As shown in FIG. 3, the multi-rate speech and audio encoding method in this embodiment includes the following steps:

Step 301: Solving the MDCT coefficients of the R3 and R4 layers, that is, solving the MDCT coefficients of the first three layers and the first four layers. For a specific solution method, reference may be made to the previous embodiment, and details are not described herein again. Taking the RE8 grid as an example, the 35 frequency modules corresponding to the MDCT coefficients are divided into two regions of 0~2kHz and 2~7kHz according to the spectrum range. For example, the spectrum range of the first 10 frequency modules is 0~2kHz, and the spectrum range of the last 25 frequency modules is 2~7kHz. How many frequency modules in the specific two regions are different in different embodiments.

Step 302: Acquire a grid point whose spectrum range is 2~7 kHz for processing.

Step 303: Determine whether the total number of bits is sufficient. If yes, go to step 305; otherwise, go to step 304;

Step 304: Set the value of the grid point whose spectrum energy is 2~7 kHz to be smaller than 0. That is, the grid points with the spectrum range of 2~7 kHz are sorted according to the order of the spectrum energy from the largest to the smallest, and the values of the grid points with the smaller n spectrum energy are set to 0, and n is an integer greater than or equal to 1. The n can be set in advance according to actual application conditions. Step 305, the index value of the grid point in the spectrum range of 2~7 kHz is solved, and the corresponding index value of the grid point where the grid element is set to 0 is also set to 0.

Step 306: Acquire a grid point whose spectrum range is 0~2 kHz for processing.

Step 307, determining whether the total number of bits is sufficient. Determining whether the total number of bits is sufficient, if yes, executing step 309; otherwise, performing step 308;

Step 308, in the case that the total number of bits is insufficient, according to the total number of available bits, the elements of the n grid points having a small spectral energy range of 0 to 2 kHz are set to 0. That is, the grid points with the spectrum range of 0~2kHz are sorted according to the order of the spectrum energy from the largest to the smallest, and the elements of the grids with the smaller m spectrum energy are set to 0, and m is an integer greater than or equal to 1. The m can be set in advance according to the actual application.

Step 309, the index value of each grid point in 0~2 kHz is solved, and the corresponding index value of the grid point where the grid element is set to 0 is also set to 0.

Step 310: Program the obtained index values into the code stream according to the order of the grid points. Specifically, the grid points corresponding to the spectrum of 2 to 7 kHz are ranked first, and the grid points corresponding to the spectrum of 0 to 2 kHz are ranked. Thus, each index value can be programmed into the code stream according to its importance in decoding.

Step 311, ending the encoding.

Through the above method, the index value of the boot item corresponding to the MDCT language of 2~7 kHz can be placed in the front position of the code stream, and the index value of the boot item corresponding to the MDCT language of 0~2 kHz can be placed behind, forming a Complete stream of code.

Step 301 in the present invention is not limited to solving the MDCT coefficients of the R3 and R4 layers, and the corresponding first few layers can be selected according to actual needs to solve the MDCT coefficients. In the present invention, the frequency range of the 35 frequency modules corresponding to the MDCT coefficients divided into two parts according to the spectrum range can also be selected according to the actual situation.

The above sorting method is applicable to an embedded multi-rate speech coding algorithm with low-level CELP coding and high-level transform coding. In the above method, 2 kHz is selected as the demarcation point because CELP coding has a good effect on the processing of low frequency signals of 0 to 2 kHz; at the same time, since the signal processed by the higher layer is the difference between the original input signal and the locally decoded lower layer signal. The signal signal of the value signal, so at high Of the signals to be processed by the layer, spectral signals above 2 kHz are more important information. Therefore, when encoding the spectrum of the difference signal, priority should be given to encoding the spectrum signal above 2 kHz, so that when the number of bits is insufficient at the decoding end, the more important information above 2 kHz can be preferentially decoded, instead of decoding less. Important low frequency information.

In the above method, the encoding method for the CELP encoding portion is the same as the encoding method for the CELP encoding portion shown in Fig. 2. I will not repeat them here.

In addition, the methods of the above steps 301 to 311 can also be combined with the method in the first embodiment to better achieve the object of the present invention.

Through the method in this embodiment, it is possible to:

1) According to the requirements of the actual encoder, mode switching is performed, and it is decided whether to select the global index order determining mode or the block index order determining mode.

2) According to the requirements of the actual encoder, the mode switching of the index sorting is performed, and it is determined whether the static mode is sorted or the dynamic mode is sorted.

3) based on the difference between the spectrum of the locally decoded lower layer signal and the difference between the original input signal and the locally decoded lower layer signal, and then the ratio of the locally decoded lower layer signal as a criterion;

4) The signal-to-noise ratio or a non-zero value that can be obtained at the codec side determines the order of the lattice code index in the code stream.

The above is only the preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

Claim

A method for multi-rate speech and audio coding, comprising:

Solving an index value corresponding to a grid point of each of the spectral vectors corresponding to the difference signal; and solving a first ratio of a spectrum of the perceptual weighting filter corresponding to the difference signal to a spectrum of the synthesized speech;

The index values corresponding to the grid points are programmed into the code stream according to the order in which the first ratio is large to small.

The multi-rate speech and audio coding method according to claim 1, wherein the calculating an index value corresponding to a lattice point that is closest to each of the spectral vectors corresponding to the difference signal comprises: a corresponding spectral signal of the difference signal;

According to the characteristics of the grid point, the frequency signal is divided into spectrum vectors;

Solving a grid point corresponding to the spectrum vector and an index value of the grid point.

The multi-rate speech and audio coding method according to claim 2, wherein the solution of the lattice point corresponding to the spectral vector and the index value of the lattice point comprises:

Solving a lattice point corresponding to the spectrum vector;

Calculating an index value of a lattice point corresponding to the spectrum vector according to the total number of bits that can be utilized; if the total number of bits is insufficient, setting an index value of a corresponding grid point of a signal that is more likely to be masked is

The multi-rate speech and audio coding method according to claim 3, wherein if the total number of bits is insufficient, the element of the corresponding lattice point of the signal that is more likely to be masked is set to zero: Calculating a second ratio of a spectrum of the difference signal to a spectrum of the synthesized speech;

Dividing a grid point corresponding to the spectrum vector into at least a part according to a critical frequency band, and if divided into more than one part, sorting each part according to the importance of the spectrum vector;

Arranging the grid points in each section in descending order of the second ratio;

When the total number of bits is insufficient, the elements of the grid points of the predetermined number of items in the rearranged grid points and the index value are set to zero.

5. The method of multi-rate speech and audio coding according to claim 4, wherein the ordering the respective parts according to the importance of the spectral vector comprises:

The parts are sorted according to the auditory characteristics of the human ear, the first few layers of coding characteristics of the encoder used, or the characteristics of the spectral coefficients of the difference signal.

6. The method of multi-rate speech and audio coding according to claim 4 or 5, wherein the formula for calculating the second ratio Ratio[k] is:

Wherein, Freq_R2 is a spectrum of a composite signal of the first few layers after the local decoding is encoded according to the input signal, and Freq_err is a spectral coefficient of the difference signal, l = S*k+i, k=0, \, 2 , . . . , 34 , /=0, 1,2, . . . ,

The multi-rate speech and audio coding method according to claim 4, wherein if the total number of bits is insufficient, the element corresponding to the signal corresponding to the signal that is more likely to be masked is set to zero, specifically : dividing the spectrum vector into at least one part according to the critical band, and if divided into more than one part, sorting the parts according to the importance of the spectrum vector;

Arranging the grid points in each part in order of the spectrum energy from large to small;

When the total number of bits is insufficient, the elements of the grid points of the predetermined number of items in each of the rearranged grid points and the index value are set to zero.

8. The method according to claim 7, wherein when the total number of bits is insufficient, the elements and index values of the lattice points of the predetermined number of items in the rearranged lattice points are set to Zero includes:

When the total number of bits is found to be insufficient in a certain partial grid point, the element and index value of the grid points of the predetermined number of items in the certain partial grid points are set to zero.

9. The method of multi-rate speech and audio coding according to claim 7, wherein the ordering the respective parts according to the importance of the spectral vector comprises:

The parts are sorted according to the auditory characteristics of the human ear, the characteristics of the encoders of the first few layers, or the characteristics of the spectral coefficients of the difference signals.

10. The method of multi-rate speech and audio coding according to claim 1, wherein the formula of the first ratio Rat[k] is:

^ log (Freq _R2[l]f + (Freq _R2[l + '

Og (W _Freq[l]f + (W _Freq[l + ] where Freq_R2 is the spectrum of the synthesized signals of the first two layers after being decoded according to the input signal, and Freq_err is the spectral coefficient of the difference signal, Hk+i, k=0,\,2, ..., 34, ,=0,1,2", 7.