CN106981292B

CN106981292B - Multi-channel spatial audio signal compression and recovery method based on tensor modeling

Info

Publication number: CN106981292B
Application number: CN201710342387.2A
Authority: CN
Inventors: 王晶; 谢湘; 刘敏; 单亚慧; 费泽松
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2020-04-14
Anticipated expiration: 2037-05-16
Also published as: CN106981292A

Abstract

The invention discloses a multi-channel spatial audio signal compression and recovery method based on tensor modeling, belongs to the technical field of audio signal processing, and particularly belongs to the technical field of spatial audio coding and decoding. And performing sound channel energy normalization on the multi-channel spatial audio signals, simultaneously obtaining sound channel energy adjustment parameters, and performing framing and time-frequency transformation on the audio signals of each sound channel to obtain characteristic parameters on a frequency domain. For a training sample set, a fourth-order audio tensor is established, three low-rank factor matrixes are obtained through tensor decomposition, tensor operation is carried out on the three-order audio tensor constructed by the test sample set, a compressed core tensor and a channel energy adjustment parameter are obtained and are transmitted together in a coding mode, tensor reconstruction is carried out on the core tensor transmitted at a decoding end and the trained low-rank factor matrix, and the reconstructed tensor signals are subjected to inverse transformation, overlap addition and energy adjustment on each channel to recover the multi-channel spatial audio signals. According to the method, a unique factor matrix training mode is adopted to perform tensor modeling on the multi-path spatial audio signals, so that higher compression efficiency can be achieved.

Description

Multi-channel spatial audio signal compression and recovery method based on tensor modeling

Technical Field

The invention relates to a multi-channel spatial audio signal compression and recovery method, which belongs to the technical field of audio signal processing, in particular to the technical field of spatial audio coding and decoding.

Background

With the research and development of digital multimedia technology, more channels of spatial audio are gradually applied to people's lives, the addition of more channels increases the surround feeling and improves the audio-visual enjoyment of people, however, the amount of data of multi-channel audio is also doubled with the rapid increase of the number of channels, so some spatial audio coding and decoding techniques, such as MPEG surround, MPEG AAC, Dolby AC-3 and the like, are developed, and most of them are implemented by down-mixing multiple channels into fewer channels at the encoding end, extracting spatial parameter information for transmission together, and up-mixing the spatial parameter information into multi-channel signals at the decoding end.

The traditional spatial parameter extraction method is generally vectorization, which destroys the spatial structure and the internal relation of original data, and a plurality of spatial audio signals (for example, a plurality of channels or a plurality of audio signals based on scene acquisition) require a greater degree of compression for a transmission requirement of a lower rate, so how to reasonably model, efficiently compress and effectively reconstruct the plurality of spatial audio signals is a key problem of spatial audio coding and decoding. The multi-path spatial audio signal can be decomposed into signals influenced by various factors, the signal space is suitable to be represented by a high-order tensor, and then tensor analysis is utilized to carry out low-rank tensor decomposition for compression and reconstruction.

Tensor analysis is derived from multilateral analysis, is a high-order generalization of vectors and matrices, can analyze and process a high-order tensor structure of a certain object by using tensor algebra, and has been widely applied in various fields. In recent years, in the field of multimedia signal processing, tensors have many successful application examples, such as constructing a tensor face based on 4 factors of characters, expressions, visual angles and light; expanding the traditional characteristic sound technology into a speaker space represented by tensor in speaker self-adaptation and speaker conversion; in the field of multi-channel audio signal processing, a patent with a publication number of 'CN 102982805A' (publication date is 3/20/2013) in China 'a multi-channel audio signal compression method based on tensor decomposition' introduces tensor into compression of multi-channel audio signals for the first time.

Disclosure of Invention

The invention mainly aims to construct a high-order model for multi-channel spatial audio signals to perform high-efficiency compression, and provides a multi-channel spatial audio signal compression and recovery method based on tensor modeling, which can not only consider the multi-factor problem of sound channels and time frequency, but also optimize the training of a factor matrix.

In order to achieve the above purpose, the basic idea of the method of the invention is as follows: for multi-channel audio signals (such as multi-channel audio), firstly, time domain channel energy normalization is carried out to obtain channel energy adjustment parameters, the audio signals of each channel are subjected to energy adjustment, windowing, framing and time-frequency transformation, then, the multi-channel audio signals are divided into a training sample set and a test sample set, for the multi-channel audio signals of the training sample set, a fourth-order audio tensor space is established on a sample, a channel, a time domain and a frequency domain, tensor decomposition is utilized to carry out low-rank approximation, three low-rank factor matrixes (a projection matrix on the sample space is set as an identity matrix) are obtained to be used for compression and recovery of the test sample set, for the multi-channel signals of the test sample set, a third-order tensor signal containing the channel, the time domain and the frequency domain is established, a low-rank nuclear tensor is obtained through tensor operation with the trained three low-rank, and finally, performing inverse transformation on the signal of each channel, and recovering the original multi-channel spatial audio signal by overlapping and adding channel energy adjustment.

The invention provides a multi-channel spatial audio signal compression method based on tensor modeling, which comprises the following steps of:

the method comprises the following steps: obtaining a channel energy adjusting parameter for time domain channel energy normalization, and obtaining an average energy value E of each channel audio signal for a plurality of channels of spatial audio signals with N channels and M samples_chThen, the values of the P-M × N energy are averaged as a standard normalized parameter E₀And the average energy value of each channel is divided by the average energy value of each channel to obtain a channel energy adjusting parameter e corresponding to each channel_ch；

Wherein x is_iC is the number of samples per channel.

Step two: using e obtained in step one_chMultiplying the audio data of the corresponding sound channel to obtain a normalized multi-channel spatial audio signal, then framing the audio signal of each sound channel by adopting a Hamming window, wherein the frame length is L, the frame shift is K, and the audio signal of each sound channel is divided into T frame sequences;

step three: performing time-frequency transformation on each frame of audio signal in each sound channel to obtain characteristic parameters with the length of F on a frequency domain;

time-frequency transformation is orthogonal transformation, preferably Discrete Cosine Transformation (DCT);

step four: taking the F characteristic parameters corresponding to each frame of audio signal of each channel as each row of the matrix, that is, the characteristic parameters of the T frame of audio signal of each channel may form a coefficient matrix with a size of T × F;

the coefficient matrixes of the N sound channels of each sample are sequentially arranged to form a third-order audio tensor space Z with the size of NxTxF, and the three dimensions are as follows: sound channel, frame sequence, frequency domain coefficient;

step five: randomly selecting M from M multi-channel samples₁One as training sample set and the rest as test sample set, M₂A plurality of;

for M₁The training samples are used for sequentially arranging the three-order audio tensor signals constructed in the step four to form a training sample with the size of M₁A fourth order audio tensor space X of X N X T X F, whose four dimensions are: samples, channels, frame sequences, frequency domain coefficients;

the parameter M satisfies M ═ M₁+M₂And as preferred M₁≥10；

Step six: carrying out tensor decomposition on the fourth-order audio tensor space X constructed in the step five as follows:

X＝S×₁U_s×₂U_c×₃U_t×₄U_f(4)

s in the above formula is a four-order low-rank nuclear tensor, and the dimensions of the tensor are M on a sample subspace, a sound channel subspace, a frame sequence subspace and a frequency domain subspace respectively₁R, Q, O; wherein U is_s、U_c、U_tAnd U_fLow rank factor matrices for the audio tensor X projection at four subspaces of sample, channel, frame sequence and frequency domain, respectively, are:

U_sa low rank matrix projected on the sample space for the audio tensor X through tensor decomposition, with size M₁×M₁Since it has no specific physical meaning in the present algorithm, it is initially set as an identity matrix I;

U_cthe audio tensor X is a low-rank matrix projected on a sound channel space through tensor decomposition, the size of the low-rank matrix is NxR, and R is more than or equal to 1 and less than or equal to N;

U_tthe audio tensor X is a low-rank matrix projected on a frame sequence space through tensor decomposition, the size of the low-rank matrix is T multiplied by Q, and Q is more than or equal to 1 and less than or equal to T;

U_fthe audio tensor X is a low-rank matrix projected on a frequency domain space through tensor decomposition, the size of the low-rank matrix is F multiplied by O, and O is more than or equal to 1 and less than or equal to F;

wherein the extract is₁、×₂、×₃、×₄Tensor matrix multiplication respectively representing tensors in a first order, a second order, a third order and a fourth order is defined as: if a tensor of order N

And a matrix

Multiplication can be expressed as W and z_nA, the result is a size I₁×…×I_n-1×J×I_n+1×…×I_NTensors of order N.

The low rank factor matrix U_tThe parameter Q of (a) is preferably Q ═ T;

the tensor decomposition is calculated by an Alternating Least square method (ALS);

step seven: to M₂Obtaining M by using each test sample in a tensor modeling mode of step four₂Carrying out tensor operation of the following formula on the three different third-order tensor signals Y and the three low-rank factor matrixes obtained in the step six to obtain a compressed nuclear tensor G, wherein the size of the nuclear tensor G is R multiplied by Q multiplied by O;

in the above formula

Three low rank factor matrixes U obtained in the step six respectively_c、U_t、U_fTransposing;

step eight: converting the kernel tensor G obtained in the step seven and the channel energy adjustment parameters obtained in the step one into one-dimensional signals, and then, quantizing and encoding the signals to transmit the signals to a decoding end, wherein the three low-rank factor matrixes do not need encoding transmission;

corresponding to the multi-path spatial audio signal compression method, the invention also provides a multi-path spatial audio signal recovery method based on tensor modeling, which comprises the following steps:

step nine: at a decoding end, obtaining a compressed core tensor G and a sound channel energy adjustment parameter through decoding and dimension increasing, and carrying out tensor reconstruction on the core tensor G and the trained three low-rank factor matrixes according to the following formula to recover an original three-order tensor space Y';

Y'＝G×₁U_c×₂U_t×₃U_f(6)

wherein, U_c、U_tAnd U_fRespectively receiving low-rank factor matrixes which are not coded, wherein the three-order audio tensor space Y' reconstructed by the above formula and the original three-order audio tensor space Y have the same dimensional combination NxT xF;

step ten: the third-order audio tensor space Y' obtained in the ninth step has N sound channels, each sound channel has T frame sequences, and each frame has F characteristic parameters in the frequency domain, so that the time domain representation of each frame of audio signals is obtained according to the time-frequency inverse transformation corresponding to the three phases in the ninth step;

the time-frequency inverse transformation and the time-frequency transformation in the second step are inverse transformation, the time-frequency transformation adopts DCT, and the time-frequency inverse transformation is Inverse Discrete Cosine Transformation (IDCT);

step eleven: overlapping and adding the audio signals on each frame time domain on each sound channel obtained in the step ten to restore a normalized multi-channel signal, wherein the frame length is L, the frame is shifted to K, and finally, the audio data of each sound channel is adjusted by utilizing the transmitted sound channel energy adjustment parameter, namely, the normalized multi-channel spatial audio data is divided by the corresponding sound channel energy adjustment parameter to restore the original multi-channel spatial audio signal;

compared with the prior art, the invention has the beneficial effects that: the method fully considers the characteristics of four influencing factors of a sample, a sound channel, a time domain and a frequency domain in the aspect of multi-path spatial audio signal modeling, establishes a fourth-order tensor model for a training sample, obtains a factor matrix at one time by tensor decomposition, establishes a third-order tensor model for a test sample, and performs tensor operation with the trained factor matrix to obtain a low-rank nuclear tensor, thereby achieving the purpose of high-efficiency compression. The unique training mode of the factor matrix of the invention not only enhances the compression capability of redundant information between channels and in channels compared with the traditional multichannel audio coding and decoding method but also compared with the training mode of other factor matrices.

Drawings

FIG. 1 is a flow diagram of encoding and decoding a multi-path spatial audio signal using tensor decomposition;

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and embodiments, and technical problems and advantages solved by the technical solutions of the present invention will be described, wherein the described embodiments are only intended to facilitate understanding of the present invention, and do not limit the present invention in any way.

The invention takes multi-channel spatial audio signals collected based on sound channels or scenes as an original database, and utilizes a compression algorithm shown in figure 1 to compress and reconstruct the original database:

the method comprises the following steps: obtaining a channel energy adjusting parameter for time domain channel energy normalization, and obtaining an average energy value E of each channel audio signal for a plurality of channels of spatial audio signals with the channel number N being 16 and the sample number M being 28_chThen, the P-16 × 28 energy values are averaged as a standard normalized parameter E₀And the average energy value of each channel is divided by the average energy value of each channel to obtain a channel energy adjusting parameter e corresponding to each channel_ch；

Wherein x_iC is the number of samples per channel.

Step two: multiplying the 16 × 28 channel energy adjustment parameters obtained in the first step by the audio data of the corresponding channel to obtain a normalized multi-channel spatial audio signal, then framing the audio signal of each channel by using a hamming window, wherein the frame length L is 960, the frame shift K is 480, and the frame sequence T of the audio signal of each channel is 899;

step three: performing Discrete Cosine Transform (DCT) on each frame of audio signal in each channel to obtain a characteristic parameter in a frequency domain, where the length F is 960;

step four: taking F feature parameters corresponding to each frame of audio signal of each channel as each row of the matrix, that is, the feature parameters of the T frame of audio signal of each channel may form a coefficient matrix with a size of T × F, that is, with a size of 899 × 960;

the coefficient matrixes of the N sound channels of each sample are sequentially arranged to form a third-order audio tensor space Z with the size of NxTxF, and the three dimensions are respectively as follows: sound channel, frame sequence, frequency domain coefficient;

the training sample set M₁Test sample set M, 20₂8; the size of the formed fourth-order audio tensor space X is 20 multiplied by 16 multiplied by 899 multiplied by 960, and the size of the third-order audio tensor space Y is 16 multiplied by 899 multiplied by 960;

X≈S×₁U_s×₂U_c×₃U_t×₄U_f(10)

s in the above formula is a four-order low-rank nuclear tensor, and the dimensions of the tensor are M in a sample subspace, a channel subspace, a frame sequence subspace and a frequency subspace respectively₁R, Q, O; wherein U is_s、U_c、U_tAnd U_fLow rank factor matrices for the audio tensor X projection at four subspaces of sample, channel, frame sequence and frequency, respectively, are:

therein is₁、×₂、×₃、×₄Tensor matrix multiplication respectively representing tensors in a first order, a second order, a third order and a fourth order is defined as: if a tensor of order N

And a matrix

The low rank factor matrix U_sThe algorithm has no physical significance, so that the algorithm is set as a unit matrix I at this time, and the size of the unit matrix I is 20 multiplied by 20; for a factor matrix U projected in the time domain_tFor example, if the low-rank projection is performed on the reconstructed audio, the reconstructed audio quality is seriously affected, so that the low-rank projection is not performed on the reconstructed audio, and Q is 899; by setting R, O to be lower, low rank projection is carried out on channels and frequency domain to obtain a low rank factor matrix U_cAnd U_fR is more than or equal to 1 and less than or equal to N, and O is more than or equal to 1 and less than or equal to F.

in the above formula

Y'＝G×₁U_c×₂U_t×₃U_f(12)

the third-order audio tensor space Y' reconstructed by the above formula has the same dimensional combination NxTxF with the original third-order audio tensor space Y, namely 16 x 899 x 960;

step ten: the third-order audio tensor space Y' obtained in the step nine has 16 sound channels, each sound channel has 899 frame sequences, and each frame has 960 characteristic parameters in the frequency domain, so that the time domain representation of each frame of audio signal is obtained according to the Inverse Discrete Cosine Transform (IDCT) corresponding to the three phases in the step;

step eleven: overlapping and adding the audio signals on each frame of time domain on each sound channel obtained in the step ten to restore normalized multi-channel spatial audio signals, wherein the frame length L is 960, the frame shift K is 480, and finally, adjusting the audio data of each sound channel by using the transmitted sound channel energy adjustment parameters, namely dividing the normalized multi-channel spatial audio data by the sound channel energy adjustment parameters of the corresponding sound channel to restore the original multi-channel spatial audio signals;

in order to further explain the concrete process of compression, one compression case is selected to give a concrete explanation: at the encoding end, for a fourth-order audio tensor X (with a size of 20 × 16 × 899 × 960), a tensor decomposition of low-rank approximation is performed on the audio tensor X in the channel, time domain, and frequency domain, respectively, where the parameters R ═ 2, Q ═ 899, and O ═ 400, then the audio tensor X may be subjected to a tensor decomposition in the channel, time domain, and frequency domain, respectivelyTo obtain a low rank factor matrix U of size 16 x 2_c899 × 899 low rank factor matrix U_t960 x 400 low rank factor matrix U_fPerforming tensor operation with a third-order audio tensor Y (size 16 × 899 × 960) to obtain a core tensor G with the size of 2 × 899 × 400; at the decoding end, the nuclear tensor G and the trained low-rank factor matrix are subjected to tensor reconstruction to recover the original third-order audio tensor Y'. The factor matrix is trained by adopting a modeling mode different from that of a test sample, taking the number of the considered samples as a subspace to construct a fourth-order audio tensor, setting the fourth-order audio tensor as an identity matrix through tensor decomposition of low-rank approximation, and performing low-rank projection in other three subspaces to obtain three low-rank factor matrices for compression and reconstruction of subsequent multi-path spatial audio.

By setting different R and O, the kernel tensors G with different sizes can be obtained, so different compression efficiencies are obtained, and because only the sound channel energy adjusting parameters and the low-rank kernel tensor are used for transmission, and the sound channel energy adjusting parameters only occupy a few bit numbers compared with the kernel tensor, the compression effect can be approximately used as the compression percentage

In this regard, the experiment selects 8 channels of spatial audio data with 16 channels as the test, and the compression percentage is shown in table 1, and the experimental result shows that the compression percentage is 94.79% when R is 2 and O is 400, which is much higher than the compression efficiency of the multi-channel audio signal based on tensor decomposition described in the patent with publication number "CN 102982805A". It has been shown through a number of experiments that the novel training pattern of the factor matrix of the present invention can provide higher compression efficiency for multi-path spatial audio signal compression.

TABLE 1 results of percentage compression of a multi-path spatial audio signal

Claims

1. A multi-channel spatial audio signal compression method based on tensor modeling is disclosed, wherein the multi-channel spatial audio signal refers to an audio signal with N channels and M samples, and the method is characterized by comprising the following steps of:

(S1) obtaining an average energy value E of each channel audio signal_chNormalized parameter E₀And a channel energy adjustment parameter e corresponding to each channel_chThe formula is as follows:

wherein x is_iC is the number of sampling points of each sound channel, and P is M multiplied by N;

(S2) Using e obtained in step S1_chMultiplying the audio data of the corresponding sound channel to obtain a normalized multi-channel spatial audio signal, then framing the audio signal of each sound channel by adopting a Hamming window, wherein the frame length is L, the frame shift is K, and the audio signal of each sound channel is divided into T frame sequences;

(S3) dividing the audio signal of each sound channel into T frame sequences, and performing time-frequency transformation on each frame of audio signal to obtain characteristic parameters with the length of F on a frequency domain;

(S4) taking the F feature parameters corresponding to each frame of audio signal of each channel as each row of the matrix, that is, the feature parameters of the T frame of audio signal of each channel form a coefficient matrix with a size of T × F; sequentially arranging coefficient matrixes of N sound channels of each sample to form a third-order audio tensor space Z with the size of NxTxF, wherein three dimensions of the Z are the sound channels, a frame sequence and frequency domain coefficients;

(S5) randomly selecting M from the multi-channel audio with M samples₁One sample is used as a training sample set, and the rest areNumber of samples M₂＝M-M₁As a test sample set; to M₁Four-order audio tensor signals are constructed by the training samples and are sequentially arranged to form a four-order audio tensor space X, and the size of the four-order audio tensor space is M₁X N X T X F, wherein M₁N, T and F are the corresponding dimensions of the four dimensions of sample, sound channel, frame sequence and frequency domain coefficient;

(S6) performing tensor decomposition on the X in step S5 as follows: x ═ S-₁U_s×₂U_c×₃U_t×₄U_fTherein, is₁、×₂、×₃And-₄Tensor matrix multiplication, U, representing tensors in first, second, third and fourth orders, respectively_s、U_c、U_tAnd U_fLow rank factor matrix of X projection under four subspaces of sample, sound channel, frame sequence and frequency domain, S is four-order low rank kernel tensor whose dimension numbers on the sample subspace, sound channel subspace, frame sequence subspace and frequency domain subspace are M₁R, Q and O;

(S7) for M₂Obtaining M by using each test sample in a tensor modeling mode of the step S4₂Carrying out tensor operation of the following formula on the three different third-order tensor signals Y and the three low-rank factor matrixes of the step S6 to obtain a compressed core tensor G, wherein the size of the compressed core tensor G is R multiplied by Q multiplied by O;

in the formula

And

three low rank factor matrices U obtained in step S6_c、U_tAnd U_fTransposing; and

(S8) the nuclear tensor G obtained in the step S7 and the e obtained in the step S1_chConverting into one-dimensional signal, quantizing, encoding, and transmittingThe low rank factor matrix does not require coded transmission.

2. The method of claim 1, wherein the low-rank matrix U is applied to the sequence of frames in a subspace_tThe size is T multiplied by Q, wherein Q is more than or equal to 1 and less than or equal to T.

3. The method of claim 2, wherein Q ═ T is determined.

4. The method of compressing a multi-channel spatial audio signal according to claim 1, wherein said time-frequency transform is an orthogonal transform.

5. The method of claim 4, wherein the orthogonal transform is preferably a discrete cosine transform.

6. The method of claim 1, wherein the low rank matrix U is a matrix of a plurality of spatial audio signals_sIs an identity matrix I with a size M₁×M₁(ii) a The low rank matrix U_cThe size is NxR, wherein R is more than or equal to 1 and less than or equal to N; the low rank matrix U_fThe size is F multiplied by O, wherein O is more than or equal to 1 and less than or equal to F.

7. A multi-path spatial audio signal restoration method based on tensor modeling, comprising the following steps in addition to the steps S1 to S8 recited in claim 1:

(S9) is decoded and updated according to the formula Y ═ G₁U_c×₂U_t×₃U_fCarrying out tensor reconstruction to recover the original third-order tensor space Y', wherein U_c、U_tAnd U_fRespectively, received low-rank factor matrices which are not subjected to coding, and G is a received quantized and coded kernel tensor;

(S10) solving time-frequency inverse transformation opposite to the encoding time for Y' to obtain time domain representation of each frame of audio signal; and

(S11) overlapping and adding the audio signals on each frame time domain on each sound channel obtained in the step (S10) to restore a normalized multichannel signal, wherein the frame length is L, the frame shift is K, and the normalized multichannel spatial audio data is divided by the corresponding channel energy adjustment parameter by using the received quantized and coded channel energy adjustment parameter to restore the original multichannel spatial audio signal.

8. The tensor modeling-based multi-path spatial audio signal recovery method as recited in claim 7, further comprising a step of the multi-path spatial audio signal compression method as recited in any one of claims 2 to 6.