CN105957533B

CN105957533B - Voice compression method, voice decompression method, audio encoder and audio decoder

Info

Publication number: CN105957533B
Application number: CN201610260757.3A
Authority: CN
Inventors: 杨洋; 姚嘉; 任金平; 高永泽
Original assignee: Hangzhou Nanosic Technology Co ltd
Current assignee: Hangzhou Nanosic Technology Co ltd
Priority date: 2016-04-22
Filing date: 2016-04-22
Publication date: 2020-11-10
Anticipated expiration: 2036-04-22
Also published as: CN105957533A

Abstract

The invention discloses a voice compression method, a voice decompression method, an audio encoder and an audio decoder, wherein MLT (maximum likelihood transform) is used for transforming a time domain signal into a frequency domain signal, an RMS (root mean square) weight analysis method is used for refining quantization grading of the frequency domain signal, and methods such as vector quantization, Huffman coding and the like are used for compressing quantization parameters (quantization weight and bit distribution number) and frequency domain data respectively so as to improve the compression ratio to the maximum extent by ensuring approximately lossless spectral characteristics.

Description

Voice compression method, voice decompression method, audio encoder and audio decoder

Technical Field

The invention belongs to the field of wireless voice signal compression, and particularly relates to a voice compression method and a decompression method based on MLT (multi-level linear transformation) and vector entropy coding, an audio encoder and an audio decoder.

Background

The voice signal compression is to save hardware memory space and facilitate storage and transmission. The wireless digital voice system is different from a common wired audio system, and utilizes air bandwidth to transmit voice signals without using wires as signal transmission carriers, thereby facilitating the actual use experience of users.

The wireless digital audio system based on the embedded technology more effectively combines the embedded technology, the audio coding and decoding technology and the wireless transmission technology together, and has the characteristics of small volume, convenient carrying, high function specialization, lower cost, high stability, good real-time performance and the like. But are limited in bandwidth, delay, and power consumption. Compression algorithms applied to wireless voice transmission are therefore required to have characteristics of high pitch, high quality and compression ratio, low delay, and low computational complexity at the same time.

The sound quality of the current frequency domain compression coding Bluetooth SBC voice algorithm is lower, and the time domain compression algorithms ADPCM, G711 and the like generally have lower compression ratios. Therefore, it is very meaningful to design a high compression ratio, low delay and low computation complexity for wireless transmission to implement higher-quality speech codec and apply it in a wireless audio system based on embedded technology.

The voice data compression utilizes the redundancy of voice signals and the unique perceptibility of the human auditory system, the redundancy of the voice signals is mainly expressed in time domain redundancy and frequency domain redundancy 2, and the currently known voice compression methods can be divided into two types according to coding modes. The first type is: time domain compression, which is performed by the coder of the type by analyzing the correlation of the speech data in the time domain; the second type is: frequency domain compression, which is a type of encoder that compresses speech data by analyzing correlations across the frequency domain.

The first type of compression method mainly adopts time domain redundancy for eliminating voice signals for compression, and sets the quantization level of an adaptive quantizer and updates the predicted value of the next data by calculating the difference value between audio data and the predicted value. The time domain prediction method is difficult to improve the subjective tone quality level under the condition of ensuring a certain compression ratio, so the time domain prediction method has the characteristics of low delay, low computation, medium tone quality and low compression ratio. The mainstream time domain prediction methods include ADPCM, G711 and the like, and the compression ratio is generally between 2:1 and 4: 1.

The second type of compression method mainly adopts the method of eliminating the frequency domain redundancy of the voice signal for compression, generally adopts the method of combining a transform domain with a psychoacoustic model, transforms time domain voice data into frequency domain data through the transform domain, then carries out hierarchical quantization on the frequency domain signal of the voice data according to the auditory characteristics of human ears through the psychoacoustic model, carries out less quantization on the frequency domain part with high auditory sensitivity of human ears, keeps higher precision, carries out more quantization on the frequency domain part with low auditory sensitivity of human ears and keeps less precision. Due to the analysis of the psychoacoustic model, the transform domain method can compress the audio data stream to the maximum extent under the condition of ensuring the subjective feeling of human ears, so the transform domain method has the characteristics of high delay, high complexity, high sound quality and low code stream. The mainstream transform domain method includes subband coding implemented by a cosine modulation filter bank, such as SBC (generally, the sound quality is about 5:1, and the compression ratio is only about 1), and coding implemented by Modified Discrete Cosine Transform (MDCT), such as CELT, SPEEX, and the like (the sound quality is high, but the delay needs 50ms to 100 ms).

Because of the high tone quality, high compression ratio, low delay and low computational complexity required by the voice code stream based on wireless voice transmission, the domain predictive coding of the mainstream in the mainstream first-type encoder can not meet the requirements due to the low compression ratio and tone quality; the mainstream transform domain coding of the second type of encoder cannot meet the requirement of wireless transmission because of high delay and high computation amount.

Disclosure of Invention

In view of the problems in the prior art, an object of the present invention is to provide a speech compression method based on MLT transform and vector entropy coding, which can simultaneously and effectively satisfy the requirements of high sound quality, low delay, high compression ratio and low complex computation of wireless speech transmission. Another object of the present invention is to provide a speech decompression method based on MLT transform and vector entropy coding.

In order to achieve the above object, the speech compression method based on MLT transform and vector entropy coding of the present invention specifically comprises:

1) MLT frequency domain transformation: converting a time domain digital voice signal collected by a digital microphone into a frequency domain spectral coefficient;

2) RMS quantization weight calculation: the frequency domain spectral coefficient is the root mean square RMS of the grouped calculation signals, and the weight of the frequency domain component is calculated through the grouped root mean square;

3) optimal grouping bit allocation: obtaining an optimal grouping bit according to the grouping signal frequency domain component weight and the set bit rate parameter;

4) carrying out vector quantization on the grouped frequency domain voice signals to generate grouped vector quantization coefficients;

5) and carrying out Huffman coding on the grouped vector quantization coefficients to complete data compression.

Further, the step 1) adopts modulation aliasing transformation, converts the PCM time domain audio data of the short time frame into MLT frequency domain spectral coefficients through MLT transformation, and groups the MLT frequency domain spectral coefficients according to frequency domain correlation.

Further, the PCM time domain audio data is firstly subjected to 50% data overlapping and mixing processing, then subjected to anti-aliasing filtering to prevent spectrum overflow, and then subjected to DCT-IV conversion to convert the time domain data into frequency domain spectral coefficients.

Further, the formula of the MLT frequency domain transform is as follows:

further, in step 2), the quantization weight is calculated by the frequency domain spectral coefficient after time-frequency conversion through root mean square RMS, and the RMS calculation formula is as follows:

calculate the quantization weight value for each set of RMS values:

further, in the step 3), the optimal grouping bit calculation method includes: and calculating the maximum bit and the minimum bit according to the quantization weight, and optimizing the grouping bits according to the bit rate parameters to ensure that the optimized bits meet the requirements of each grouping spectral coefficient under the bit limit.

Further, according to the quantization weight value, calculating each group of bit distribution coefficients:

category(r)＝MAX{0,MIN{7,(offset-rms_index(r)/2)}}；

(0≤r≤number_of_regions；-32≤offset≤31)；

calculating the bit number required by the prediction quantization according to the bit distribution parameter:

then, the number of available bits is calculated from the set bit rate parameter:

estimated_number_of_available_bits＝320+((number_of_available_bits.320)*5/8)；

and adjusting the bit distribution parameters of each group to obtain the maximization of each group of available bits within the range of the available bits, and determining the optimal grouping bits.

Further, the processing procedures of the step 4) and the step 5) are as follows:

A) dividing the frequency domain spectral coefficient into sign bit and intensity, and calculating the normalization index of each group of intensity:

k(i)＝MIN{(x*magnitude of(mlt(20r+i))+deadzone_rounding),kmax}

((0＜i＜20；x＝1/(stepsize*(magnitude_of_rms(r))；)；

B) the normalized indexes are grouped into a vector group bit stream:

C) and performing Huffman coding on each group of vector groups and symbol bit groups to form a compressed bit stream.

A speech decompression method based on MLT transformation and vector entropy coding aiming at the speech compression method adopts inverse vector quantization and inverse MLT to decompress the speech after data compression, and specifically comprises the following steps:

1) analyzing and performing Haffman decoding on the compressed bit stream to obtain a vector group and a symbol bit group;

2) carrying out inverse normalization operation on the vector group to obtain the intensity of the frequency domain spectral coefficient and a corresponding sign bit to obtain the frequency domain spectral coefficient;

3) and performing inverse modulation aliasing transformation IMLT on the frequency domain spectral coefficient to acquire time domain voice data and finish decoding.

Further, in the step 1), the code stream data after being coded and compressed is analyzed, and time domain PCM stream information of a sampling rate, a bit rate and a time division frame length is obtained.

Further, the inverse normalization operation formula in step 2) is as follows:

further, the IMLT transformation formula in step 3) is as follows:

wherein

An audio encoder implementing the voice compression method comprises an MLT frequency domain converter, an RMS quantization weight calculator, an optimal grouping bit position distributor and a Huffman encoder, wherein a time domain signal is converted into a frequency domain signal through the MLT converter, the RMS quantization weight calculator is adopted to refine the quantization grade of the frequency domain signal, the optimal grouping bit position distributor and the Huffman encoder are adopted to respectively compress quantization parameters and frequency domain data, and the voice data compression ratio is improved to the maximum extent under the condition of ensuring approximately lossless spectral characteristics.

An audio decoder implementing the above speech decompression method comprises a code stream analyzer, a huffman decoder, an inverse vector quantizer, and an inverse MLT transform filter, wherein:

reading code stream data subjected to coding compression in a code stream analyzer for analysis, and acquiring time domain PCM stream information such as a sampling rate, a bit rate, a time division frame length and the like;

decoding and obtaining RMS weight, bit distribution parameter and quantized MLT frequency domain spectrum vector in a Huffman decoder;

in an inverse vector quantizer, performing inverse quantization operation on the quantized MLT frequency domain spectrum vector by using RMS weight and bit allocation parameters to obtain an MLT frequency domain spectrum coefficient;

in an inverse MLT transform filter, performing inverse MLT transform filtering on the MLT frequency domain spectral coefficients to obtain time domain PCM data;

and controlling PCM data through PCM stream information analyzed by the code stream, and reconstructing and integrating the PCM voice code stream.

The invention has the following beneficial effects: the method realizes high compression ratio, low delay and medium operation complexity under the condition of ensuring high tone quality of voice data, and is more suitable for wireless voice application.

Drawings

FIG. 1 is a compression flow diagram;

FIG. 2 is a decompression flow diagram;

FIG. 3 is a schematic diagram of an MLT transform;

FIG. 4 is a flow chart of optimal bit allocation;

FIG. 5 is a diagram of raw PCM waveform data in the time domain;

FIG. 6 is a graph of raw PCM waveform data spectrum data;

FIG. 7 is a time domain data plot of PCM waveform data after MLT transform;

FIG. 8 is a diagram of the data spectrum of the PCM waveform data after MLT transformation.

Detailed Description

The present invention now will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The invention relates to a voice compression method based on MLT transformation and vector entropy coding, which specifically comprises the following steps:

(1) an MLT (modulated mapped transform) frequency domain converter, wherein the MLT converter is a frequency domain converter, can convert time domain data into independent frames in short time, adopts a 50% frame aliasing mode to ensure that the frequency spectrum of critical data is not distorted, and has the characteristics of linearity, perfect signal reconstruction and the like; the MLT transform formula is as follows:

(2) an RMS quantization weight calculator, the RMS calculating a Root-Mean-Square (Root-Mean-Square) of the grouped frequency domain spectral coefficients for representing the quantization weights; compared with the quantization weight represented by an absolute value, the quantization level represented by the RMS value is more, the quantization precision is higher, and the RMS calculation formula is as follows:

calculate the quantization weight value for each set of RMS values:

(3) and the optimal grouping bit distributor calculates the bit distribution coefficient of each group according to the quantization weight value:

category(r)＝MAX{0,MIN{7,(offset-rms_index(r)/2)}}，

(0≤r≤number_of_regions；-32≤offset≤31)；

estimated_number_of_available_bits＝320+((number_of_available_bits.320)*5/8)，

adjusting the bit distribution parameters of each group to obtain that each group of available bits reaches the maximum within the range of the available bit number, and determining the optimal grouping bit;

(4) vector quantization is carried out on the frequency domain spectral coefficients to generate grouped vector quantization coefficients:

dividing the frequency domain spectral coefficient into sign bit and intensity, and calculating the normalization index of each group of intensity:

k(i)＝MIN{(x*magnitude of(mlt(20r+i))+deadzone_rounding),kmax}，

((0＜i＜20；x＝1/(stepsize*(magnitude_of_rms(r))；)，

the normalized indexes are grouped into a vector group bit stream:

(5) and performing Huffman coding on each group of vector groups and symbol bit groups to form a compressed bit stream.

(1) decoding and analyzing the compressed code stream by a Huffman decoder to obtain quantized MLT frequency domain spectral coefficient quantized data;

(2) performing inverse quantization analysis on MLT frequency domain pedigree number quantized data by adopting an inverse vector quantizer, performing inverse normalization operation on a vector group, and obtaining frequency domain spectral coefficient intensity and a corresponding sign bit to obtain a frequency domain spectral coefficient;

(3) performing IMLT (inverse modulation aliasing transform) on the frequency domain spectral coefficients to acquire time domain voice data and finish decoding; the IMLT transformation formula is as follows:

wherein

In the invention, the embodiment of the compression part is as shown in figure 1:

(1) sampling voice data using a digital microphone, acquiring PCM raw digital voice data, and dividing the voice data into short time frames: 5ms (80sample), 10ms (160sample) or 20ms (320sample), and writes information such as the bit rate sampling rate of the PCM configuration into the code stream.

(2) And converting the time domain PCM data of the short-time frame into MLT frequency domain spectral coefficients through MLT transformation. And grouping the MLT frequency domain spectral coefficients according to the frequency domain correlation, and dividing the MLT frequency domain spectral coefficients into 20 groups of MLT frequency domain spectral vectors.

(3) And calculating the RMS of the grouped MLT frequency domain spectral vectors by an RMS weight calculator to obtain the quantization weight of each group of frequency domain spectral vectors, and directly writing the quantization weight into the code stream.

(4) And using the quantization weight RMS of the grouped frequency domain spectrum coefficients in the optimal bit distributor to perform bit distribution calculation on each grouped MLT frequency domain spectrum vector to obtain the optimal bit distribution number, wherein the bit distribution number is also directly written into a code stream.

(5) In the vector quantizer set, quantized spectral coefficients are quantized using quantization weights and optimal bit allocation. And grouping MLT frequency domain spectral vectors to perform vector quantization.

(6) And in a Huffman encoder, carrying out Huffman encoding on the quantization weight, the bit distribution parameter and the quantized grouped MLT frequency domain spectrum vector to obtain a final encoding compressed code stream.

In the present invention, the specific implementation of the decoding part is as shown in fig. 2:

(1) in a code stream analyzer, code stream data subjected to coding compression is analyzed to obtain time domain PCM stream information such as a sampling rate, a bit rate, a time division frame length and the like;

(2) decoding and obtaining RMS weight, bit distribution parameter and quantized MLT frequency domain spectrum vector in a Huffman decoder;

(3) in the inverse vector quantizer, the quantized MLT frequency-domain spectral vectors are inverse quantized using RMS weights and bit allocation parameters. Obtaining an MLT frequency domain spectral coefficient;

(4) in an inverse MLT transform filter, performing inverse MLT transform filtering on the MLT frequency domain spectral coefficients to obtain time domain PCM data;

(5) and controlling PCM data through PCM stream information analyzed by the code stream, and reconstructing and integrating the PCM voice code stream.

As shown in fig. 3, which is a schematic diagram of MLT transformation, PCM time-domain audio data is first subjected to 50% data overlap mixing processing, then subjected to anti-aliasing filtering to prevent spectrum overflow, and then subjected to DCT-IV transformation to transform the time-domain data into frequency-domain spectral coefficients. As shown in fig. 5, 6, 7, and 8, the PCM data before and after the MLT transform shows that the transformed PCM data and the original PCM data have lossless effect on both time domain and frequency domain information.

As shown in fig. 4, the optimal bit allocation process is a process of allocating bits according to the spectral coefficients of the packet frequency domain:

(1) firstly, analyzing RMS quantization weight information of the group of frequency domain spectral coefficients, setting bit distribution parameters, and carrying out bit distribution calculation;

(2) and then, calculating the bit number consumed by the predicted bit allocation according to the bit allocation result, and analyzing whether the current predicted bit allocation number meets the limit in the limitation of the pre-set signal-to-noise ratio and the residual bit number. If not, resetting the bit allocation parameter, and performing bit allocation again, and if so, entering bit allocation calculation of the next group of frequency domain spectral coefficients. And simultaneously updating the residual bit number for the next group of bit allocation operation.

The psychoacoustic model, the bit allocation and the quantization mode of the embodiment are optimized to simplify the computational complexity of the psychoacoustic model, and the verified frequency domain auditory threshold and the masked threshold are directly applied to analyze the subband data; and because the bit allocation unit adopts a symmetrical quantization scheme, the bit allocation result is not directly transmitted to the decoding end through the code stream, but the bit allocation number is calculated at the decoding end through the same bit allocation mechanism through the quantization factor, so that a large number of code streams are reduced and can be used for transmitting quantized audio data, and the bit allocation number can be adjusted at any time according to the wireless transmission environment by setting the code stream length adjusting parameter.

As described above, in the present invention, the perfectly reconstructed MLT transform is used for time domain to frequency domain conversion for the characteristics of wireless voice transmission application, so that high voice quality of voice data is ensured, the MLT transform length can be directly modified according to the system requirement for delay, low delay is ensured, optimal bit allocation is adopted to ensure that the compression ratio is highest without affecting voice quality, and finally huffman coding is adopted to further compress quantized data.

Claims

1. A method of speech compression, the method comprising:

5) performing Huffman coding on the grouped vector quantization coefficients to complete data compression;

the bit allocation unit adopts a symmetrical quantization scheme, the bit allocation number is calculated at a decoding end through the same bit allocation mechanism according to the result of bit allocation through a quantization factor, and a code stream length adjustment parameter is set, so that the bit allocation number can be adjusted at any time according to a wireless transmission environment;

the method comprises the following steps of carrying out bit allocation according to a unit of a grouped frequency domain spectral coefficient:

(2) then, calculating the bit number consumed by the predicted bit allocation according to the bit allocation result, and analyzing whether the current predicted bit allocation number meets the limit or not under the limit of the preset signal-to-noise ratio and the residual bit number; if not, resetting bit allocation parameters, and performing bit allocation again, and if so, performing bit allocation calculation of the next group of frequency domain spectral coefficients; simultaneously updating the residual bit number for the next group of bit allocation operation;

in the step 3), the optimal grouping bit calculation method comprises the following steps: calculating a maximum bit and a minimum bit according to the quantization weight, and optimizing grouping bits according to the bit rate parameters to ensure that the optimized bits meet the requirements of each grouping spectral coefficient under the bit limit; and calculating the distribution coefficient of each group of bits according to the quantization weight value:

category(r)＝MAX{0,MIN{7,(offset-rms_index(r)/2)}}；

0≤r≤number_of_regions；-32≤offset≤31；

2. The speech compression method as recited in claim 1, wherein the step 1) converts the PCM time domain audio data of the short time frame into MLT frequency domain spectral coefficients through MLT transform using modulation aliasing transform, and the MLT frequency domain spectral coefficients are grouped by frequency domain correlation; the PCM time domain audio data is firstly subjected to 50% data overlapping and mixing processing, then subjected to anti-aliasing filtering to prevent spectrum overflow, and then subjected to DCT-IV transformation to transform the time domain data into frequency domain spectral coefficients.

3. The speech compression method of claim 1, wherein the MLT frequency-domain transform is formulated as follows:

0≤m＜N，0≤n＜2N，N∈(80，160，320)；

in the step 2), the quantization weight of the frequency domain spectral coefficients after time-frequency conversion is calculated through root mean square RMS, and the RMS calculation formula is as follows:

4. the voice compression method according to claim 1, wherein the processing procedures of the step 4) and the step 5) are as follows:

k(i)＝MIN{(x*magnitude of(mlt(20r+i))+deadzone_rounding),kmax}

0＜i＜20；x＝1/(stepsize*(magnitude_of_rms(r)；

B) the normalized indexes are grouped into a vector group bit stream:

j＝index to j_thvalue of k()；vd＝vector dimension；

5. A speech decompression method is characterized in that inverse vector quantization and inverse MLT are adopted to decompress speech after data compression, and specifically comprises the following steps:

3) performing inverse modulation aliasing transformation IMLT on the frequency domain spectral coefficient to acquire time domain voice data and finish decoding;

the optimal grouping bit calculation method comprises the following steps: calculating a maximum bit and a minimum bit according to the quantization weight, and optimizing grouping bits according to the bit rate parameters to ensure that the optimized bits meet the requirements of each grouping spectral coefficient under the bit limit; and calculating the distribution coefficient of each group of bits according to the quantization weight value:

category(r)＝MAX{0,MIN{7,(offset-rms_index(r)/2)}}；

0≤r≤number_of_regions；-32≤offset≤31；

6. The speech decompression method according to claim 5, wherein in step 1), the code stream data after being coded and compressed is analyzed to obtain time domain PCM stream information of a sampling rate, a bit rate and a time division frame length; the inverse normalization operation formula in the step 2) is as follows:

indicates taking the greatest integer value less than or equal to z，

i＝(n+1)vd-j-1；0≤j≤vd-1；0≤n≤vpr-1。

7. the speech decompression method according to claim 5, wherein the IMLT transform formula in the step 3) is as follows:

wherein

8. An audio encoder is characterized by comprising an MLT frequency domain converter, an RMS quantization weight calculator, an optimal grouping bit distributor and a Huffman encoder, wherein a time domain signal is converted into a frequency domain signal through the MLT converter, the RMS quantization weight calculator is adopted to refine the quantization grade of the frequency domain signal, the optimal grouping bit distributor and the Huffman encoder are adopted to respectively compress quantization parameters and frequency domain data, and the compression ratio of voice data is improved to the maximum extent under the condition of ensuring approximately lossless spectral characteristics; the bit allocation unit adopts a symmetrical quantization scheme, the bit allocation number is calculated at a decoding end through the same bit allocation mechanism according to the result of bit allocation through a quantization factor, and a code stream length adjustment parameter is set, so that the bit allocation number can be adjusted at any time according to a wireless transmission environment;

category(r)＝MAX{0,MIN{7,(offset-rms_index(r)/2)}}；

0≤r≤number_of_regions；-32≤offset≤31；

9. An audio decoder comprising a stream analyzer, a huffman decoder, an inverse vector quantizer, an inverse MLT transform filter, wherein:

controlling PCM data through PCM stream information analyzed by the code stream, and reconstructing and integrating PCM voice code stream;

category(r)＝MAX{0,MIN{7,(offset-rms_index(r)/2)}}；

0≤r≤number_of_regions；-32≤offset≤31；