CN110739000A

CN110739000A - Audio object coding method suitable for personalized interactive system

Info

Publication number: CN110739000A
Application number: CN201910972165.8A
Authority: CN
Inventors: 胡瑞敏; 胡晨昊; 王晓晨; 武庭照; 吴玉林
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2020-01-31
Anticipated expiration: 2039-10-14
Also published as: CN110739000B

Abstract

The invention discloses an audio object coding method suitable for a personalized interactive system, which comprises the steps of firstly, framing and converting a plurality of audio objects to be coded from a time domain to a frequency domain, sequencing according to the energy of each object, determining the coding sequence of the objects, circularly extracting each step of coded objects and corresponding downmix signals, calculating parameters and residual errors of each step according to the parameters, decomposing large-size residual matrixes by using singular values, decompressing the final mixed signals, the parameters and the residual decomposition matrixes into code streams, reconstructing the residual errors by using the decomposition matrixes in a decoding stage, and then gradually decoding and reconstructing the objects from the downmix signals according to the residual errors and the parameters of each object.

Description

Audio object coding method suitable for personalized interactive system

Technical Field

The invention belongs to the technical field of digital audio signal processing, and particularly relates to an multi-step progressive downmixing and reconstructed audio object coding and decoding method which is suitable for a personalized interactive system of spatial audio and allows a user to adjust an audio object according to the requirement of the user.

Background

The spatial audio technology based on channel coding can realize coding and reconstruction of three-dimensional audio scenes, and can provide more immersive auditory experience than mono or stereo audio technologies, such as MPEG spatial audio coding, NHK22.2 speaker arrays and the like, so that the spatial audio technology is more and more popular with people.

Many internationally scholars and research institutes have conducted research work on audio object coding, and proposed various audio object coding methods. The most representative of these is Spatial audio object joint coding (SAOC) proposed by Fraunhofer, the german well-known research institute [ document 1], which encodes a downmix signal transmitting a plurality of audio objects and side information, and separates and reconstructs the audio objects from the downmix signal based on the side information at a decoding end. The SAOC method can transmit a large number of audio objects at a low bit rate, greatly improving the coding efficiency of the audio objects, and enabling a user to perform personalized adjustment and interaction according to the listening needs of the user [ document 2 ].

In the SAOC framework, in order to obtain a lower coding bit rate, the same parameters are used as side information in the same subband, which results in aliasing distortion in the frequency domain, and severely degrades the hearing experience, for example, audio object signals may contain other object signal components to be mixed when played [ document 3 ]. even, this problem may affect the spatial audio personalized interactive service at the subsequent user end.

Document 1: breebaart, J., Engdeg. ard, J., Falch, C., et al., Spatial Audio object coding (saoc) -the upper case standard on parameter object based Audio coding. in: Audio Engineering Society Convention 124.Audio Engineering Society (2008).

Document 2: coleman, P., Franck, A., Francombe, J., et al, An audio-visual system for object based audio: From recording to listing. IEEE Transactions on multimedia 20(8), 1919-.

Document 3: wu, T., Hu, R., Wang, X., Ke, S.: Audio object coded based on optimal parameter frequency resolution. multimedia Tools and Applications pp.1-16(2019). Ref.4: spatial audio objects with two-step coding structure for interactive audio service IEEETransactions on Multimedia 13(6),1208-1216(2011).

Document 5: lee, B., Kim, K., Hahn, M. effective residual coding method of spatial audio object coding with two-step coding structure for interactive audio services. E.E. TRANSACTIONS on Information and Systems 99(7), 1949-.

Disclosure of Invention

In order to solve the technical problems, the invention provides audio object coding and decoding methods for multi-step progressive downmixing and reconstruction, which can perform high-quality audio coding and decoding at medium and low bit rates and ensure that all audio objects have good decoding tone quality.

audio object coding method suitable for personalized interactive system, characterized by comprising the following steps:

step A1: performing frame windowing on an input audio object sequence, converting a time domain signal into a frequency domain signal, and obtaining a time-frequency matrix of each audio object;

step A2: according to the time-frequency matrix of each object, calculating the frequency domain energy of the objects to sort, and determining the object to be coded in each step in multi-step progressive coding;

step A3, according to the determined coding sequence, gradually down-mixing and calculating corresponding side information, wherein the step-by-step down-mixing refers to adding matrixes to data of objects input in the current processing flow to obtain sum matrixes, the step-by-step down-mixing signals are not transmitted as transmission code streams, the side information comprises object residual errors and object gain parameter matrixes, and the object gain parameters are calculated through the energy ratio of two input signals in an object pair;

step A4: decomposing the object residual error in the side information into a left singular matrix, a right singular matrix and singular values by singular value decomposition;

step A5: quantizing the singular matrix, the singular value and the object gain parameter to obtain a side information code stream;

step A6: coding the final downmix signal in the step A3 to obtain a downmix signal code stream;

step A7: and synthesizing the code streams obtained in the step A5 and the step A6 into an output code stream, and transmitting the output code stream to a decoding end.

Compared with the existing audio object coding technology, the invention has the advantages that: multi-step progressive encoding and decoding are utilized, residual errors are utilized to compensate decoding distortion to the maximum extent, and each audio object is guaranteed to have good listening quality; and simultaneously, singular value decomposition is introduced to decompress residual error information in a dividing mode, so that the code rate is reduced. Therefore, the invention can ensure that high-quality audio objects are obtained by decoding under medium and low code rates so as to meet the use requirements of the audio personalized interaction system.

Drawings

FIG. 1 is a diagram of the encoding principle of an embodiment of the present invention;

fig. 2 is a decoding schematic diagram of an embodiment of the present invention.

Detailed Description

To facilitate understanding and practice of the present invention for those skilled in the art, the following technical solution is described with reference to the accompanying drawings and specific examples, it should be understood that the examples described herein are only for illustration and explanation of the present invention and are not intended to limit the present invention:

firstly, according to the optimal coding sequence of the object frequency domain energy research, determining the object which needs to be coded and calculate side information in each step, finally obtaining the residual error information of each object, effectively reducing the signal distortion and confusion of all reconstructed objects, and then dividing the residual error information into three low-dimensional matrixes by using a singular value decomposition method, thereby achieving the purposes of compressing the residual error information and reducing the bit rate.

Referring to fig. 1, the present invention proposes a multi-audio object coding method adapted to a personalized interactive system, where the present embodiment is illustrated by inputting A, B, C, D four objects, and the specific embodiment includes the following steps:

step A1: inputting audio objects A, B, C, D (which may include various objects such as human voice, piano, guitar, etc.), framing and windowing each object, converting the time domain signal to the frequency domain signal, and obtaining a time-frequency matrix of each audio object;

in this embodiment, an -dimensional sound signal in an original time domain is converted into a two-dimensional spectrogram in a frequency domain by framing, windowing and modified discrete cosine transform MDCT, and the obtained object data in a matrix form is output.

The input audio object signal sample rate is 44.1Khz, bit depth is 16 bits, wav audio format.

It should be noted that the audio parameters and object types specified herein are only for illustrating the implementation process of the present invention, and are not used to limit the present invention.

In the frame windowing, each frame is 1024 in length, a hanning window is selected as a window function, and 50% of time domains are overlapped; selecting Modified Discrete Cosine Transform (MDCT) by time-frequency transform, wherein the transform length is 2048 points; finally, a plurality of audio object signals in the form of a matrix are output, wherein the number of rows of the matrix is equal to the number of frames (or the number of columns is equal to the number of frames), and the number of columns of the matrix is equal to the number of frequency points (or the number of rows is equal to the number of frequency points).

It should be noted that the frame length, the type of window function, the transformation method, etc. specified herein are only for illustrating the specific implementation steps of the present invention, and are not used to limit the present invention.

in the embodiment, according to the object data in the form of a matrix, the frequency domain energy of the object is calculated, a large-to-small energy sorting mode is selected, and the sequence of the object to be coded in each step is determined; the coding order refers to the priority of coding audio objects with larger energy.

The calculation of the object frequency domain energy is shown as follows:

wherein, | | S_iI | represents the total energy of the ith audio object, O_iRepresenting the proportion of the ith object in the total energy of all the objects; according to each object O_iThe values are sorted from big to small in the order of D (S)₁)、B(S₂)、A(S₃)、C(S₄) Preferably encoding O_iObjects with large values; it should be noted that i ∈ [1, 4] specified here]And the order of the steps from large to small, are merely examples of the specific implementation steps of the present invention and are not intended to limit the present invention.

Step A3: according to the coding sequence, gradually down-mixing and calculating corresponding side information (object residual error, singular matrix and singular value);

in the embodiment, the step-by-step down mixing refers to performing matrix addition on data by using an object input in the current processing flow to obtain sum matrixes, wherein step-by-step down mixing signals are not transmitted as a transmission code stream, and side information comprises an object residual error and an object gain parameter matrix, wherein the object gain parameter is obtained by calculating the energy ratio of two input signals in an object pair;

the calculation formula of the object residual and the object gain parameter is as follows:

wherein R (i) is the residual signal of the i +1 th object, G_o(i) Gain parameter for the i +1 th object, G_d(i) A gain parameter for an ith downmix signal; x in the formula_iRepresenting the downmix signal, P, obtained in step i_o(i) Is the energy of object i, P_d(i) Is the energy of the downmix signal of the ith step. In this embodiment, N is 4, which indicates the number of objects to be encoded.

It should be noted that the number N of objects defined herein is 4, which is merely an example of the implementation steps of the present invention and is not used to limit the present invention.

In connection with this example, the multi-step down-mix calculation procedure according to the above formula determined in step A2 is as follows, step , down-mix and parameter extraction is performed with object D, B as object pair (in step , D is regarded as down-mix signal for calculation), and the down-mix signal X of two objects is obtained₁And calculating to obtain a gain parameter G of the second object B_o(1) And its residual R (1); second, down-mix signal X₁A is taken as an object pair to carry out down mixing and parameter extraction to obtain a down mixing signal X of the second step₂And calculating a gain parameter G of a third object A_o(2) And its residual R (2); third, down-mix signal X₂C, performing down-mixing and parameter extraction on the object pair to obtain a down-mixing signal X of the third step₃(i.e., the final downmix signal that needs to be transmitted to the decoding end), and calculates a gain parameter G of the fourth object C_o(3) And its residual R (3). At this point, the four objects complete the down-mixing and parameter extraction through the above three steps.

It should be noted that the encoding sequence and the number of steps specified herein are only for illustrating the specific implementation steps of the present invention, and are not used to limit the present invention.

Step A4: decomposing the object residual in the side information into a coefficient matrix and a kernel vector by using singular value decomposition;

in the embodiment, the dimension reduction compression is carried out on the residual error matrixes of a plurality of objects by a singular value decomposition method, so that the data volume increase caused by residual error information is reduced; the residual matrix is decomposed into three small matrixes which are a left singular matrix, a singular value matrix and a right singular matrix respectively; wherein the singular value matrix transmits only the values on the matrix diagonal.

SVD is a matrix eigenvalue decomposition, a matrix decomposition method for reducing a matrix into its constituent parts, so that a high-dimensional matrix is decomposed into several low-dimensional matrices for representation, and the purpose of data compression is achieved.

Wherein, R (i)_P×QThe residual signal of the (i + 1) th object is obtained, the row number P is halves of the MDCT transformation length, the column number Q is the frame number of the audio object, U is a left singular matrix, Lambda is a singular value matrix, V is a right singular value matrix, and the singular values on the diagonal line in the Lambda matrix are sorted from large to small.

For dimensionality reduction, the first r singular values (r-50) and the corresponding singular matrix approximation r (i) may be selected as follows:

wherein the content of the first and second substances,

which is the portion of the matrix of singular values,

and

first 5 of the original left and right singular matricesRow (or column) 0. Residual signals can be approximately represented by the three matrixes, matrix dimensionality is reduced, and side information data volume is compressed.

It should be noted that r-50 is only given to illustrate the specific implementation steps of the present invention and is not used to limit the present invention.

Step A5: quantizing the singular value, the singular matrix and the object gain parameter to obtain a side information code stream;

in the quantization operation, the value ranges of elements in the residual decomposition matrix and the gain parameter are different, so that the quantization table is unified by performing quantization before quantization, then the closest quantization value is searched in the quantization table according to the size of each element value, and the corresponding quantization index is output as a side information quantization code stream.

in this embodiment, the final downmix signal is a basis for reconstructing the object signal at the decoding end, and is encoded by using AAC128 k.

It should be noted that the AAC128k coding of the final downmix signal is only to illustrate the specific implementation steps of the present invention and is not used to limit the present invention.

Referring to fig. 2, the invention also provides multi-audio object decoding methods suitable for a personalized interactive system, wherein the embodiment is exemplified by inputting A, B, C, D four objects, and the specific implementation example comprises the following steps:

step B1: analyzing the received code stream to obtain a side information code stream and a final downmix signal code stream;

in this embodiment, parsing the code stream refers to performing a back-stepping according to a method for synthesizing the output code stream to obtain a final downmix signal code stream and a side information code stream.

Step B2: carrying out AAC decoding on the down-mixed signal code stream to obtain a down-mixed signal;

in this embodiment, the final downmix signal code stream is a data stream obtained after AAC encoding and compressing, and the final downmix signal before transmission can be obtained after AAC decoding.

Step B3: the side information code stream is dequantized to obtain a left singular matrix, a right singular matrix, singular values and object gain parameters;

in this embodiment, the side information is classified into when quantization is performed, and is classified into when dequantization is performed.

Step B4: performing matrix synthesis on the left singular matrix, the right singular matrix and the singular value to recover an object residual error;

in this embodiment, the matrix synthesis is to multiply the left singular matrix, the singular value matrix, and the right singular matrix to obtain an approximate object residual, which is specifically shown in the formula:

step B5: decoding backward according to the coding order, and circularly reconstructing an audio object frequency domain signal from the transmission downmix signal by using the side information;

separating the object from the corresponding downmix signal by using the object gain parameter, and calculating with the residual signal to compensate for aliasing distortion to obtain a reconstructed audio object frequency domain signal, as shown in the following formula:

wherein, S'_iIs a reconstructed frequency domain object signal, X'_iIs a reconstructed progressive downmix signal, G_d(i) For each step corresponds to a gain parameter of the downmix signal.

Is the residual information obtained by matrix synthesis at the decoding end, i.e. the work done in step B4. The decoding order of the objects is opposite to the encoding order, each object being analytically reconstructed from the stepwise downmix signal in a corresponding decoding step.

In connection with the present example, the multi-step progressive reconstruction of the object according to the above equations (8), (9) and (10) according to the decoding order determined in step B5 is as follows, step , using the gain parameter G_o(3) And its residual error

From the final downmix signal X₃Middle reconstructed object C (i.e., S'₄) Using the gain parameter G_d(3) From the final downmix signal X₃The reconstruction obtains a progressive down-mixing signal X'₂(ii) a Secondly, gain parameter Go (2) and residual error thereof are utilized

From the progressive downmix Signal X'₂Middle reconstructed object A (i.e., S'₃) Using the gain parameter G_d(2) From most gradually downmix signal X'₂The reconstruction obtains a progressive down-mixing signal X'₁(ii) a Third, using the gain parameter G_o(1) And its residual error

From the progressive downmix Signal X'₁Middle reconstructed object B (i.e., S'₂) Using progressive downmix signal X'₁Is subtracted from the reconstructed object B to obtain a reconstructed object D (i.e., S'₁). And finally, sequentially restoring the object from the corresponding gradually-mixed down signal through three-step decoding, and compensating the reconstructed signal by using residual information to reduce the tone quality reduction caused by aliasing distortion.

It should be noted that A, B, C, D the four objects and the number of decoding steps are only used to illustrate the implementation steps of the present invention and are not used to limit the present invention.

Step B6: and converting the audio object signal in the frequency domain into the time domain by using time-frequency inverse transformation.

In this embodiment, the gradually reconstructed object signal is still a frequency domain signal, and the time-frequency inverse transformation is performed to convert the object signal into a time domain, so that subsequent functions such as rendering, personalized interaction, playing and the like can be performed. Therefore, the inverse transform in the decoding method is to perform windowing on the object frequency domain signal, and improve the inverse discrete cosine transform operation to obtain the time domain connection signal.

Compared with the existing audio object coding method, the method has the advantages and characteristics that:

multi-step progressive encoding and decoding are utilized, residual errors are utilized to compensate decoding distortion to the maximum extent, and each audio object is guaranteed to have good listening quality; and simultaneously, singular value decomposition is introduced to decompress residual error information in a dividing mode, so that the code rate is reduced. Therefore, the invention can ensure that high-quality audio objects are obtained by decoding under medium and low code rates so as to meet the use requirements of the audio personalized interaction system.

Claims

1, method for encoding audio objects adapted to a personalized interactive system, comprising the steps of:

2. The audio object encoding method adapted to the personalized interactive system as set forth in claim 1, wherein in step A1, the original time domain dimensional sound signal is transformed into the frequency domain two dimensional spectrogram by framing, windowing and Modified Discrete Cosine Transform (MDCT), and the obtained matrix-form object data is output.

3. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step A2, according to the object data in the form of matrix, calculating the energy of object frequency domain, selecting the energy sorting mode from big to small, and determining the object sequence to be coded in each step; coding order, which means that audio objects with larger coding energy are preferentially coded;

the calculation of the frequency domain energy of the object is shown as follows:

wherein, | | S_iI | represents the total energy of the ith audio object, O_iRepresenting the ith subject in the total energy of all subjectsThe proportion of the components is calculated; according to each object O_iThe values are sorted from big to small in the order of D (S)₁)、B(S₂)、A(S₃)、…、C(S_N) N is the number of objects to be encoded, and O is preferentially encoded_iObjects with large values.

4. The audio object coding method adapted to the personalized interaction system of claim 1, wherein in the step A3, side information of the coded objects is down-mixed and calculated step by step, and only object side information is calculated per step;

wherein R (i) is the residual signal of the i +1 th object, G_o(i) Gain parameter for the i +1 th object, G_d(i) A gain parameter for an ith downmix signal; x_iRepresenting the downmix signal, P, obtained in step i_o(i) Is the energy of object i, P_d(i) The energy of the mixed signal in the ith step; n represents the number of objects to be encoded.

5. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in the step A4, carrying out dimension reduction compression on residual error matrixes of a plurality of objects by a singular value decomposition method, and reducing data volume increase brought by residual error information; decomposing the residual matrix into three small matrixes, namely a left singular matrix, a singular value matrix and a right singular matrix; wherein the singular value matrix transmits only the values on the matrix diagonal.

6. The method of claim 1, wherein in the step A5, the side information is quantized by a table lookup method, the element values of the residual decomposition matrix and the gain parameter matrix are normalized before quantization, the closest quantization value is looked up in a quantization table according to the size of each element value, and the corresponding quantization index is outputted as the side information quantization code stream.

7. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step a6, the final downmix signal is encoded by an AAC encoder and then a code stream is output.

8. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step a7, synthesizing an output code stream refers to merging the final downmix signal code stream and the side information code stream, and adding a flag bit for identifier resolution; and finally, the down-mixing signal code stream refers to an output code stream after AAC coding, and the side information code stream refers to a quantization index code stream output after the residual decomposition matrix and the gain parameter are quantized.

An audio object decoding method adapted to a personalized interactive system, characterized by decoding the code generated by the method of any of claims 1-8, ;

the specific implementation comprises the following substeps:

step B1: analyzing the received code stream to obtain a side information code stream and a down-mixing signal code stream;

step B3: the side information is dequantized to obtain a left singular matrix, a right singular matrix, a singular value and an object gain parameter;

step B6: the audio object signals in the frequency domain are converted to the time domain using a time-frequency transform.

10. The audio object decoding method adapted to a personalized interactive system according to claim 9, characterized in that: in step B4, the matrix synthesis is to multiply the left singular matrix, the singular value matrix, and the right singular matrix to obtain an approximate object residual error.