CN113096672B

CN113096672B - Multi-audio object coding and decoding method applied to low code rate

Info

Publication number: CN113096672B
Application number: CN202110312781.8A
Authority: CN
Inventors: 胡瑞敏; 吴玉林; 王晓晨; 胡晨昊; 柯善发; 张灵鲲; 刘文可
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2022-06-14
Anticipated expiration: 2041-03-24
Also published as: CN113096672A

Abstract

The invention discloses a multi-audio object coding and decoding method applied to low code rate, which comprises the steps of firstly transforming a plurality of input audio objects to a frequency domain in a coding stage; then, the audio object signals of the frequency domain are mixed down to obtain mixed signals, and a side information matrix after the sub-bands of the single audio object are subdivided is calculated; secondly, performing dimensionality reduction expression on the edge information matrix by using a coding module in the convolutional self-coder; and finally, synthesizing the mixed signal and the side information after dimensionality reduction into a code stream. In the decoding stage, firstly, received code streams are decomposed to obtain downmix signals and side information; then a dense connection module is introduced into a convolutional self-encoder decoder network to reconstruct original high-dimensional side information data from a low-dimensional structure of side information, and finally the reconstructed frequency domain audio object signal is converted into a time domain signal. The invention can comprehensively improve the decoding quality of the audio object signal under low code rate so as to meet the requirement of the user on the personalized control of the audio object.

Description

Multi-audio object coding and decoding method applied to low code rate

Technical Field

The invention belongs to the technical field of digital audio signal processing, and relates to an audio object coding and decoding method for compressing side information and reconstructing by using a convolution self-encoder and a dense connection mixed network, which is suitable for a spatial audio personalized interaction system under a low code rate and allows a user to adjust an audio object according to the requirement of the user.

Background

Three-dimensional (3D) audio represents an audio object with 3 degrees of freedom (e.g., azimuth, elevation, and distance). It can form sound images anywhere in 3D space. 3D audio technology is mainly used in entertainment systems to provide an immersive and personalized experience. Immersive spatial sound representation is divided into three types: channel-based coding techniques, higher-order ambient sound-based coding techniques, and object-based coding techniques. The channel-based sound representation is fed each channel signal to a loudspeaker that is fixed in position relative to the listener. Although channel-based coding techniques are well established, the audio content produced by the techniques is associated with a particular speaker configuration, and the techniques are limited by the number of channels and do not meet the user's needs for personalized manipulation of audio objects, especially in immersive scenes, such as virtual reality and augmented reality somatosensory interactive games. Higher-order ambient sound based coding techniques use coefficient signals to reconstruct a 3D spatial sound field. Although the coefficient signal has no direct relation to the channel or object, the coding techniques of the fundamental and higher order ambient sounds are not suitable for controlling a single object in a sound scene. Each audio object position in the object-based coding method is completely independent of the loudspeaker position, and an object signal is rendered to a target position by combining a personalized rendering system. The object-based encoding method overcomes the dependency of the generated audio content on the loudspeaker position. And realize the high immersive effect in the sound scene, for example, bird or helicopter fly through at the top of the head, rain falls from the sky, thunder comes from any direction listening effect. The object-based coding framework has been successfully used in Dolby Atmos.

A typical representative of Object-based Coding is Spatial Audio Object Coding (SAOC), and the core idea of SAOC is that only one downmix and side-information parameter is required for transmitting a plurality of Object signals, so that various Audio objects can be coded at a low bit rate at the same time. However, as the number of audio objects increases and the bitrate is lower, the SAOC reconstructed audio objects will bring spectral aliasing.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-audio object coding and decoding method applied to low code rate, which can comprehensively improve the decoding quality of audio object signals and improve the coding efficiency under low code rate.

The invention provides a multi-audio object coding and decoding method applied to low code rate, which is used for the dimension reduction expression of audio object side information, wherein the dimension reduction expression of the audio object side information comprises the following steps:

step A1: performing time domain-frequency domain transformation on J input independent audio signals through Modified Discrete Cosine Transform (MDCT) to obtain frequency spectrums of object signals;

step A2: performing fine sub-band division on each frame of frequency spectrum data obtained in the step A1; determining the number of fine sub-band partitions according to the influence of the number of sub-bands on the frequency spectrum aliasing distortion;

step A3: calculating the downmix signals of all objects for the sub-band in the step A2 to obtain a downmix signal code stream;

step A4: calculating the side information of each object for the sub-band in the step A2 to obtain a side information matrix;

step A5: transmitting the side information matrix obtained from A4 into an encoder module of a convolution self-encoder to obtain a low-dimensional feature expression result R of the side information of the audio object, and then quantizing the side information value according to a table look-up method to obtain a side information code stream;

step A6: and D, synthesizing the code streams obtained in the step A3 and the step A5 into an output code stream, and transmitting the output code stream to a decoding end.

The invention provides a multi-audio object coding and decoding method applied to a low code rate, which is used for reconstructing original high-dimensional data from a low-dimensional structure and specifically comprises the following steps:

step B1: decomposing the received code stream to obtain a down-mixing signal code stream and a side information code stream;

step B2: decoding the down-mixing signal code stream obtained in the step B1 to obtain a down-mixing signal;

step B3: b1, the side information code stream obtained in the step B is subjected to dequantization operation to obtain side information;

step B4: inputting the side information obtained in the step B3 into a convolutional self-encoder decoder module with a dense connection module to obtain reconstructed audio object side information;

step B5: obtaining a reconstructed audio object spectrum according to the downmix signal obtained by the B2 and the object side information obtained by the B4;

step B6: and performing Inverse Modified Discrete Cosine Transform (IMDCT) processing according to the audio object frequency spectrum obtained by the B5 to obtain a reconstructed time domain signal of a single object.

Compared with the existing audio object coding, the invention has the advantages that: the effective characteristics of the side information are extracted from the coding module of a Convolutional Auto Encoder (CAE), and the dimension of the side information parameter is reduced to save the bit rate. And dense connections (DenseNet) are introduced in the decoding module of the convolutional auto-encoder to enhance feature transfer between layers of the decoding neural network. Thereby well reconstructing the audio object. Therefore, the invention can comprehensively improve the decoding quality of the audio object signals under low code rate so as to meet the requirement of the user on the personalized control of the audio object.

Drawings

FIG. 1 is a flow chart of encoding according to an embodiment of the present invention.

Fig. 2 is a decoding flow diagram of an embodiment of the present invention.

FIG. 3 is a block diagram of a convolutional autoencoder model structure according to an embodiment of the present invention.

Detailed Description

For the convenience of those skilled in the art to understand and implement the present invention, the following technical solutions will be further described with reference to the accompanying drawings and specific embodiments, it should be understood that the embodiments described herein are only used for illustrating and explaining the present invention, and are not used for limiting the present invention:

the invention develops research on the basis of the existing audio object coding method and provides a multi-audio object coding and decoding method applied to low code rate. The method comprises the steps of firstly utilizing an encoding module in a convolution self-encoder to carry out dimension reduction expression on side information, then introducing dense connection into a decoding module of the convolution self-encoder to enhance feature transfer among layers of a decoding neural network, and realizing reconstruction of original high-dimensional side information data from a low-dimensional structure of the side information, so that the low-dimensional features of the side information are fully utilized, and the aim of reducing code rate is fulfilled.

The invention provides a multi-audio object coding and decoding method applied to low code rate, which comprises a coding method and a decoding method;

referring to fig. 1, the encoding method of the present embodiment is specifically implemented by the following steps:

step A1: input as a time-domain signal S of a plurality of audio objects₁,S₂,…,S_JFor different kinds of audio object signals, such as drum set, bass, human voice, etc., the sampling frequency is 44.1kHz, the bit depth is 16 bits, and the audio format is the wav format.

In this embodiment, J independent audio signals S are inputted₁,S₂,…,S_JPerforming time-frequency domain transformation by improving discrete cosine transform (MDCT) to obtain frequency spectrum O of object signal₁,O₂,…,O_J；

In this embodiment, time-domain-frequency-domain conversion is performed on the audio object signal in the time domain through 2048-point modified discrete cosine transform MDCT during time synchronization to obtain a spectrum matrix of a single object, where the number of rows (number of columns) of the matrix is equal to the number of frames, and the number of columns (number of rows) is equal to the number of frequency points.

It should be noted that the frame length, the type of window function, the transformation method, etc. specified herein are only for illustrating the specific implementation steps of the present invention, and are not used to limit the present invention.

Step A2: for the spectrum O obtained in step A1₁,O₂,…,O_JCarrying out fine sub-band division on each frame of data;

in the embodiment, according to the influence of the number of subbands on the aliasing distortion of the restored audio object frequency spectrum, the evaluation index SDR is used for determining the number of fine subband divisions.

In this embodiment, since ERB divides each frame signal into 28 subbands, each subband is uniformly subdivided into 10 subbands on the basis of 2ERB subbands.

It should be noted that the number of sub-bands specified herein is only for illustrating the specific implementation flow of the present invention, and is not used to limit the present invention.

in this embodiment, the spectrum information of all objects is subjected to matrix addition to obtain downmix signal data, and the calculation of the downmix signal is as follows:

sign () is a sign function, and is used for obtaining a sign of a variable; o is_j(i, m) is the spectrum information of the jth object, j is the number of the object, and b is the number of the frequency point.

In this embodiment, the downmix signal is encoded by an AAC encoder, and the code rate is controlled to 128kbps, so as to obtain a downmix signal code stream;

it should be noted that the use of AAC 128kbps coding for the final downmix signal is merely to illustrate the specific implementation steps of the present invention and is not intended to limit the present invention.

Step A4: for the sub-band in the step A2, calculating the side information of each object to obtain a side information matrix G₁,G₂,…,G_J；

In this embodiment, the side information of the object is

Wherein, P_j(I, B) represents the energy of object J in subband (I, B), I is the total frame number, J is the number of objects, B is the subband number; i is more than or equal to 1 and less than or equal to I, J is more than or equal to 1 and less than or equal to J, and B is more than or equal to 1 and less than or equal to B.

Step A5: side information matrix G obtained for A4₁,G₂,…,G_JTransmitting the audio object side information to an encoder module of a convolution self-encoder to obtain a low-dimensional feature expression result R of the audio object side information and obtain a side information code stream;

in the embodiment, the encoder module of the convolutional self-encoder is used for carrying out dimension reduction expression on the side information, so that the data volume of the side original information is reduced. And then quantizing the edge information value according to a table look-up method, and finally forming a code stream by the corresponding quantization index for output.

Step A6: and (C) synthesizing the code streams obtained in the step (A3) and the step (A5) into an output code stream, and transmitting the output code stream to a decoding end.

The synthesizing of the output code stream in this embodiment means integrating the code stream of the final downmix signal with the side information code stream. And finally, the down-mixing signal code stream refers to an output code stream after AAC coding, and the side information code stream refers to a quantization index code stream output by the encoder module of the convolutional self-encoder.

Referring to fig. 2, the decoding method of the present embodiment includes the following steps:

in this embodiment, the downmix signal code stream and the side information code stream are obtained by parsing the code stream according to the code stream received by the decoding end.

Step B2: b1, carrying out AAC decoding on the down-mixed signal code stream obtained in the step B to obtain a down-mixed signal;

in this embodiment, the side information obtained in step B3 is input into a decoder module of a convolutional self-encoder, wherein a dense connection network is added into the decoder module of the convolutional self-encoder to enhance feature transfer between layers of a decoding neural network, so as to obtain reconstructed side information of the audio object

The original high-dimensional side information data is reconstructed from the low-dimensional structure of the side information, the low-dimensional characteristics of the side information are fully utilized, and the aim of reducing the code rate is fulfilled.

Referring to fig. 3, in the embodiment of the present invention, a dense connection network is added to a convolutional autocoder decoding module, and the structure includes three modules: module 1, module 2, and module 3;

the module 1 consists of a convolution layer, a remolding layer, a pooling layer and a flattening layer and is used for extracting characteristics of input side information data through a convolutional neural network, compressing the extracted characteristics by utilizing a pooling technology, and further performing low-dimensional expression processing on the characteristics by the convolution layer;

the module 2 consists of a remolding layer, a deconvolution layer and a deconvolution layer, wherein the remolding layer is densely connected with the two deconvolution layers and is used for decoding the low-dimensional expression of the side information data characteristics, and the dense connection is introduced to enhance the characteristic transfer among the layers of the decoding neural network;

module 3 consists of an deconvolution layer, a remodeling layer and a convolution layer for further decoding of the low-dimensional expression of the side information data features, which can be seen as the inverse operation of module 1.

In this embodiment, the decoded side information is input to a decoding portion of a convolutional auto-encoder that introduces dense connections, and high-dimensional side information data is reconstructed from a low-dimensional side information structure.

in this embodiment, the reconstructed audio object spectrum

Wherein the content of the first and second substances,

is the frequency domain of the reconstructed audio object j,

is a down-mix signal that has been coded and decoded,

is the dequantized side information; m is the number of the frequency points, A_b-1And A_b-1 represents the start and end frequency points of subband b; i is more than or equal to 1 and less than or equal to I, J is more than or equal to 1 and less than or equal to J, B is more than or equal to 1 and less than or equal to B, A_b-1≤m≤A_b-1。

Step B6: audio object spectra obtained from B5

Performing Inverse Modified Discrete Cosine Transform (IMDCT) processing to obtain reconstructed time domain signal of single object

In this embodiment, frequency domain-time domain transform is performed by using inverse modified discrete cosine transform IMDCT, and finally a time domain signal of a reconstructed audio object is obtained.

The invention extracts the effective characteristics of the side information from the coding module of the Convolutional Auto Encoder (CAE), and reduces the dimension of the side information parameters to save the bit rate. And dense connections are introduced into a decoding module of the convolutional self-encoder, so that feature transfer between layers of the decoding neural network is enhanced. Thereby well reconstructing the audio object. Therefore, the invention can comprehensively improve the decoding quality of the audio object signals under low code rate so as to meet the requirement of the user on the personalized control of the audio object.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-audio object coding and decoding method applied under low code rate is characterized in that: comprises an encoding method and a decoding method;

the coding method is specifically realized by the following steps:

step A2: performing fine sub-band division on each frame of frequency spectrum data obtained in the step A1; determining the number of fine sub-band partitions according to the influence of the number of sub-bands on the aliasing distortion of the frequency spectrum;

step A6: synthesizing the code streams obtained in the step A3 and the step A5 into an output code stream, and transmitting the output code stream to a decoding end;

the decoding method is specifically realized by the following steps:

step B4: inputting the side information obtained in the step B3 into a convolutional self-encoder decoder module with a dense connection module to obtain the reconstructed audio object side information;

step B6: carrying out Inverse Modified Discrete Cosine Transform (IMDCT) processing according to the audio object frequency spectrum obtained by the B5 to obtain a reconstructed time domain signal of a single object;

wherein, a dense connection network is added in a decoding module of the convolution self-encoder to reconstruct original high-dimensional side information data from a low-dimensional structure of the side information;

the convolutional self-encoder decoding module is added with a dense connection network, and the structure of the convolutional self-encoder decoding module comprises three modules: module 1, module 2, and module 3;

the module 1 consists of a convolution layer, a remolding layer, a pooling layer and a flattening layer and is used for extracting features of input side information data through a convolutional neural network, compressing the extracted features by utilizing a pooling technology, and further performing low-dimensional expression processing on the features by the convolution layer;

the module 2 consists of a remolding layer, a deconvolution layer and a deconvolution layer, wherein the remolding layer is densely connected with the two deconvolution layers and is used for decoding the low-dimensional expression of the edge information data characteristics;

the module 3, which is composed of an deconvolution layer, a remodeling layer and a convolution layer, is used for further decoding the low-dimensional expression of the side information data characteristics, and the operation is the reverse operation of the module 1.

2. The method of claim 1, wherein the method comprises: in step a1, a time-frequency domain transform is performed on the audio object signal in the time domain by a 2048-point modified discrete cosine transform MDCT to obtain a spectrum of a single object.

3. The method of claim 1, wherein the method comprises: in step a2, the evaluation index SDR is used to determine the number of fine subband divisions according to the influence of the number of subbands on the aliasing distortion of the restored audio object spectrum.

4. The method of claim 1, wherein the method comprises: in step a3, the spectral information of all objects is matrix-added to obtain downmix signal data.

5. The method for multi-audio-object coding and decoding at low code rate according to any of claims 1-4, wherein: in step A4, the edge of the objectThe information is

6. The method of claim 1, wherein the method comprises: in step B2, the AAC is used to decode the down-mix signal code stream to obtain the down-mix signal before encoding.

7. The method of claim 5, wherein the method comprises: in step B5, the reconstructed audio object spectrum

Wherein the content of the first and second substances,

is the frequency domain of the reconstructed audio object j,

is a down-mix signal that has been coded and decoded,

8. The method of claim 1, wherein the method comprises: in step B6, frequency domain-time domain transform is performed by using inverse modified discrete cosine transform IMDCT to finally obtain a time domain signal of the reconstructed audio object.