CN109448741B

CN109448741B - 3D audio coding and decoding method and device

Info

Publication number: CN109448741B
Application number: CN201811395574.8A
Authority: CN
Inventors: 闫建新; 王磊
Original assignee: Digital Rise Technology Co Ltd
Current assignee: Digital Rise Technology Co Ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2021-05-11
Anticipated expiration: 2038-11-22
Also published as: CN109448741A

Abstract

The invention discloses a 3D audio coding and decoding method and a device, wherein the 3D audio coding method comprises S110, inputting a sound channel signal, a target signal and metadata; s120, coding the sound channel signal through a sound channel core coder to obtain a sound channel code stream; s130, encoding the target signal through a target encoder to obtain a target code stream; s140, encoding the metadata through a metadata encoder to obtain a metadata code stream; s150, packing the sound channel code stream, the target code stream and the metadata code stream in a frame format according to a 3D audio data structure, and outputting the 3D audio code stream. The invention can realize the high-efficiency coding and decoding of the 3D audio code stream.

Description

3D audio coding and decoding method and device

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for 3D audio encoding and decoding.

Background

With the development of the application of ultra-high definition television and the like in the future, the requirement for audio is further increased so as to obtain an immersive (immersive) auditory effect, for which the number of channels of the input audio signal is significantly increased (for example, 5.1.4, 7.1.4, 22.2 and the like), and in addition, an independent target audio signal is included, and some data information (metadata) related to the channels and the target signal is included, and a 3D audio code stream is generated by performing efficient compression on the information so as to facilitate efficient transmission and storage and the like.

The conventional DRA coding is a coding of a channel signal, does not include, for example, an enhancement coding tool such as bandwidth extension bwe (bandwidth extension) or the like, and cannot efficiently code a 3D channel audio signal (without making better use of inter-channel correlation), for example, the 3-layer 22.2 channel case. In addition, the encoding target audio signal is not supported, and the coding bit data information is also not supported.

CDR (China Digital Radio) coding can only complete single-channel, stereo and 5.1 sound channel coding, adds SBR (spectral Band replication) coding tool on the basis of DRA, and does not support coding 3D audio signals, such as 22.2 three-channel signals.

The current 3D audio coding standards, such as MPEG-H3D audio coding, Dolby AC-4 and Aruo, have different coding systems and adopt different technical modules to form, but the generated 3D audio code stream has low efficiency, and the decoding of the 3D audio code stream can not be realized efficiently.

Disclosure of Invention

The invention provides a 3D audio coding and decoding method and device aiming at the problems in the prior art, and can realize high-efficiency coding and decoding of a 3D audio code stream.

The technical scheme provided by the invention for the technical problem is as follows:

in one aspect, the present invention provides a 3D audio encoding method, including:

s110, inputting a channel signal, a target signal and metadata;

s120, coding the sound channel signal through a sound channel core coder to obtain a sound channel code stream;

s130, encoding the target signal through a target encoder to obtain a target code stream;

s140, encoding the metadata through a metadata encoder to obtain a metadata code stream;

s150, packing the sound channel code stream, the target code stream and the metadata code stream in a frame format according to a 3D audio data structure, and outputting a 3D audio code stream;

the 3D audio data structure comprises frame header information, sound channel coding information, target coding information and metadata coding information which are sequentially arranged; or, the 3D audio data structure includes frame header information, channel coding information, metadata coding information related to the channel signal, target coding information, and metadata coding information related to the target signal, which are sequentially arranged;

the data structure of the sound track code stream comprises frame header information, middle layer sound track coding information, control information of middle layer sound track BWE information, other layer sound track coding information, control information of other layer sound track BWE information and other layer sound track BWE information which are sequentially arranged; or the data structure of the soundtrack code stream comprises frame header information, soundtrack coding information, control information of the soundtrack BWE information, and soundtrack BWE information, which are sequentially arranged;

the data structure of the target code stream comprises frame header information, target coding information, control information of target BWE information and target BWE information which are sequentially arranged;

the data structure of the metadata code stream comprises metadata control information and metadata coding information which are sequentially arranged.

In another aspect, the present invention provides a 3D audio decoding method, including:

s210, inputting a 3D audio code stream, and splitting the 3D audio code stream into a channel code stream, a target code stream and a metadata code stream;

s220, decoding the sound channel code stream through a sound channel core decoder to obtain a sound channel signal;

s230, decoding the target code stream through a target decoder to obtain a target signal;

s240, decoding the metadata code stream through a metadata decoder to obtain metadata;

s250, rendering the sound channel signal and the target signal according to the metadata, and outputting the rendered signal to a corresponding terminal for playing according to user interaction information;

the data structure of the 3D audio code stream comprises frame header information, sound channel coding information, target coding information and metadata coding information which are sequentially arranged; or, the 3D audio data structure includes frame header information, channel coding information, metadata coding information related to the channel signal, target coding information, and metadata coding information related to the target signal, which are sequentially arranged;

In another aspect, the present invention provides a 3D audio encoding apparatus capable of implementing all the processes of the 3D audio encoding method, where the 3D audio encoding apparatus includes:

a first input module for inputting a channel signal, a target signal and metadata;

the sound channel core encoder is used for encoding the sound channel signals by adopting a sound channel core encoding algorithm to obtain sound channel code streams;

the target encoder is used for encoding the target signal to obtain a target code stream;

the metadata encoder is used for encoding the metadata to obtain a metadata code stream; and the number of the first and second groups,

the output module is used for packing the sound track code stream, the target code stream and the metadata code stream in a frame format according to a 3D audio data structure and outputting the 3D audio code stream;

In another aspect, the present invention provides a 3D audio decoding apparatus, which is capable of implementing all the processes of the 3D audio decoding method, where the 3D audio decoding apparatus includes:

the second input module is used for inputting a 3D audio code stream and splitting the 3D audio code stream into a channel code stream, a target code stream and a metadata code stream;

the sound channel core decoder is used for decoding the sound channel code stream to obtain a sound channel signal;

the target decoder is used for decoding the target code stream to obtain a target signal;

the metadata decoder is used for decoding the metadata code stream to obtain metadata; and the number of the first and second groups,

the renderer is used for rendering the sound channel signals and the target signals according to the metadata and outputting the rendered signals to a corresponding terminal for playing according to user interaction information;

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

during coding, for an input sound channel signal, a target signal and metadata, a sound channel core encoder is adopted to encode the sound channel signal, a target encoder is adopted to encode the target signal, a metadata encoder is adopted to encode the metadata, and a coded sound channel code stream, a coded target code stream and a coded metadata code stream are combined into a 3D audio code stream to realize efficient coding of the 3D audio code stream; when decoding, splitting an input 3D audio code stream into a sound channel code stream, a target code stream and a metadata code stream, decoding the sound channel code stream through a sound channel core decoder, decoding the target code stream through the target decoder, decoding the metadata code stream through the metadata decoder, rendering the sound channel signal, the target signal and the metadata, and realizing the efficient decoding of the 3D audio code stream.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a 3D audio encoding method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a 3D audio encoding method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a channel core encoder in a 3D audio coding method according to an embodiment of the present invention;

fig. 4 is a schematic drawing illustrating the reconstruction of high frequency details in a 3D audio encoding method according to an embodiment of the present invention;

fig. 5 is a schematic drawing illustrating another method for reconstructing high frequency details in a 3D audio encoding method according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a first template in a template shape library in a 3D audio encoding method according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a second template in the template shape library in the 3D audio encoding method according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a third template in the template shape library in the 3D audio encoding method according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a fourth template in the template shape library in the 3D audio encoding method according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a fifth template in the template shape library in the 3D audio encoding method according to an embodiment of the present invention;

fig. 11 is a diagram illustrating a sixth template in the template shape library in the 3D audio encoding method according to an embodiment of the present invention;

fig. 12 is a schematic diagram of a seventh template in the template shape library in the 3D audio encoding method according to the embodiment of the present invention;

fig. 13 is a schematic diagram of an eighth template in the template shape library in the 3D audio encoding method according to an embodiment of the present invention;

fig. 14 is a schematic diagram of a data structure of a soundtrack stream in a 3D audio encoding method according to an embodiment of the present invention;

fig. 15 is a schematic diagram of another data structure of a soundtrack stream in a 3D audio coding method according to an embodiment of the present invention;

fig. 16 is a schematic diagram of a data structure of a 3D audio code stream in a 3D audio encoding method according to an embodiment of the present invention;

fig. 17 is a schematic diagram of another data structure of a 3D audio code stream in a 3D audio coding method according to an embodiment of the present invention;

fig. 18 is a schematic diagram of a data structure of a target code stream in a 3D audio encoding method according to an embodiment of the present invention;

fig. 19 is a schematic diagram of another data structure of a target code stream in a 3D audio coding method according to an embodiment of the present invention;

fig. 20 is a schematic diagram illustrating a data structure of a metadata stream in a 3D audio encoding method according to an embodiment of the present invention;

fig. 21 is a detailed schematic diagram of a 3D audio encoding method according to an embodiment of the present invention;

fig. 22 is a diagram illustrating a specific operation of a channel core encoder in a 3D audio coding method according to an embodiment of the present invention;

fig. 23 is a flowchart illustrating a 3D audio decoding method according to a second embodiment of the invention;

fig. 24 is a schematic diagram of a 3D audio decoding method according to a second embodiment of the present invention;

fig. 25 is a schematic diagram illustrating the operation of a channel core decoder in the 3D audio decoding method according to the second embodiment of the present invention;

fig. 26 is a specific schematic diagram of a 3D audio decoding method according to a second embodiment of the present invention;

fig. 27 is a diagram illustrating a specific operation of a channel core decoder in a 3D audio decoding method according to a second embodiment of the present invention;

fig. 28 is a schematic structural diagram of a 3D audio encoding apparatus according to a third embodiment of the present invention;

fig. 29 is a schematic structural diagram of a 3D audio decoding apparatus according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Example one

An embodiment of the present invention provides a 3D audio encoding method, and referring to fig. 1, the method includes the following steps:

s110, inputting a channel signal, a target signal and metadata;

s150, packing the sound channel code stream, the target code stream and the metadata code stream in a frame format according to a 3D audio data structure, and outputting the 3D audio code stream.

It should be noted that the input of 3D audio coding includes a conventional channel signal, a target signal (or called an object audio signal) and related metadata. The metadata refers to parameters describing the channel signal and the target signal, such as spatial position, presence, motion trajectory, type, loudness, and the like of the target signal. As shown in fig. 2, a sound channel signal (such as stereo, 5.1, 7.1, 10.1, or 22.2, etc.) is compressed by a sound channel core encoder to form a sound channel code stream, and metadata forms a metadata code stream by a metadata encoder; and generating a target code stream by the target signal through a target encoder, and combining the last three code streams into a final 3D audio code stream.

The step S120 specifically includes:

s121, dividing the input sound channel signal into an LFE sound channel signal, an independent sound channel signal and a sound channel pair signal;

s122, carrying out 2-time down-sampling on the LFE sound channel signal, and compressing by adopting sensory audio coding to obtain an LFE sound channel code stream;

s123, encoding the independent sound channel signal to obtain an independent sound channel code stream;

s124, coding the sound channel pair signal to obtain a sound channel pair code stream;

s125, packing the LFE sound channel code stream, the independent sound channel code stream and the sound channel pair code stream according to a sound channel coding data structure in a frame format, and outputting the sound channel code stream.

Note that, as shown in fig. 3, the channel signals include multi-channel audio signals, that is, an LFE (low frequency enhancement) channel signal entering an LFE channel, an independent channel signal entering an independent channel, and a channel pair signal entering a channel pair. The LFE sound channel signal is firstly subjected to 2-time down-sampling, then is directly compressed by adopting certain sensory audio coding, and an LFE sound channel code stream is output. The independent channel signal and the channel pair signal need to be coded differently according to certain parameters, such as coding rate requirement (or sound quality requirement).

In a preferred embodiment, the step S123 specifically includes:

carrying out waveform coding on a low-frequency part in the independent sound channel signal, and carrying out waveform parameter mixed coding on a high-frequency part in the independent sound channel signal to obtain an independent sound channel code stream;

the step S124 specifically includes:

and carrying out waveform coding on the low-frequency part in the sound channel pair signal, and carrying out waveform parameter mixed coding on the high-frequency part in the sound channel pair signal to obtain a sound channel pair code stream.

In this embodiment, the independent channel signal and channel pair signal are encoded as follows:

(1) the 2048 PCM sample points are input to a 32-band CQMF analysis module, output as 32 sub-bands, each sub-band represented by 64 CQMF sample points as:

x[k][n]k＝0,1,...,31n＝0,1,...,63

(2) according to coding bit rate and other information x [ k ]][n]Divided into low frequency LF-CQMF denoted as x_lf[k][n]And high frequency HF-CQMF denoted x_hf[k][n]Wherein:

x_lf[k][n]k＝0,1,...,K-1n＝0,1,...,63

x_hf[k][n]k＝K,K+1,...,31n＝0,1,...,63

the selection of K is determined according to information such as the encoding bit rate, when the encoding bit rate is high, K may be larger, and when the encoding bit rate is small, K may be smaller.

(3) X is to be_lf[k][n]And inputting the signal into an LF-CQMF synthesis module, and outputting a low-frequency time domain signal.

(4) X is to be_hf[k][n]Firstly modulating the signal to low frequency, then inputting the signal to an HF-CQMF synthesis module, and outputting a high-frequency time domain signal.

(5) And inputting the low-frequency time domain signal into a low-frequency coding module for coding to obtain a low-frequency coding code stream. The low frequency encoding module may be any of the existing waveform encoding methods, such as DRA, AAC, MP3, and the like.

(6) And inputting the high-frequency time domain signal into a high-frequency coding module for coding to obtain a high-frequency coding code stream. The high frequency coding block may be any of the existing waveform parameter coding methods, such as HILN, MELP, ACELP, TCX hybrid coding, and the like.

(7) And multiplexing the low-frequency coding code stream and the high-frequency coding code stream.

In another preferred embodiment, the step S123 specifically includes:

s131, obtaining the coding rate requirement of the independent channel signal, if the coding rate requirement is high, executing the step S132, and if the coding rate requirement is low or medium, executing the step S133;

s132, carrying out sensory audio coding on the independent sound channel signal to obtain the independent sound channel code stream;

s133, performing bandwidth extension coding on a high-frequency part in the independent channel signal to obtain a bandwidth extension parameter and high-frequency coding information; carrying out sensory audio coding on a low-frequency part in the independent sound channel signal to obtain low-frequency coding information; and taking the bandwidth extension parameter, the high-frequency coding information and the low-frequency coding information as the independent sound channel code stream.

It should be noted that, when encoding the independent channel signal, as shown in fig. 3, it is determined whether to start the bandwidth extension encoding function in the channel core encoder according to some parameters, such as the encoding rate requirement (or the sound quality requirement). Generally, when the requirement on the coding code rate is higher, the independent sound channel signal is directly subjected to 2-time down-sampling, then certain perceptual audio coding is adopted for compression, and the independent sound channel code stream is output; when the coding code rate requirement is low, starting, firstly carrying out bandwidth expansion coding on a high-frequency part of the independent sound channel signal to obtain bandwidth expansion parameters and high-frequency coding information, and then carrying out certain perception audio coding on a low-frequency part of the independent sound channel signal to obtain low-frequency coding information, wherein the bandwidth expansion parameters, the high-frequency coding information and the low-frequency coding information are output as independent sound channel code streams.

Further, the step S124 specifically includes:

s141, judging whether the channel pair signal has correlation with other channel pair signals; if yes, performing decorrelation processing on the signal of the channel with correlation, and executing step S142, otherwise, executing step S142;

s142, acquiring the coding rate requirement of the sound channel on the signal, if the coding rate requirement is low, executing the step S143, if the coding rate requirement is medium, executing the step S144, and if the coding rate requirement is high, executing the step S145;

s143, performing parameter stereo coding on the channel pair signal to obtain stereo parameters and a downmixed single-channel signal; performing bandwidth extension coding on a high-frequency part in the single sound channel signal to obtain bandwidth extension parameters and high-frequency coding information; carrying out perceptual audio coding on a low-frequency part in the single-sound-channel signal to obtain low-frequency coding information; taking the stereo parameters, the bandwidth expansion parameters, the high-frequency coding information and the low-frequency coding information as the sound channel pairing code stream;

s144, performing bandwidth extension coding on a high-frequency part in the sound channel pair signal to obtain a bandwidth extension parameter and high-frequency coding information; carrying out perceptual audio coding on a low-frequency part in the signal of the sound channel to obtain low-frequency coding information; taking the bandwidth extension parameter, the high-frequency coding information and the low-frequency coding information as the sound channel pairing code stream;

s145, carrying out sensory audio coding on the sound channel pair signal to obtain the sound channel pair code stream.

It should be noted that, when encoding a channel pair signal (stereo signal), as shown in fig. 3, first, it is determined whether the channel pair can form a 4-channel group (or a higher channel group) with other channel pairs, that is, it determines correlation between the channel pairs, if the channel pair can form the 4-channel group (or the higher channel group), it needs to start a multi-channel decorrelation function in a channel core encoder, perform multi-channel decorrelation processing on the 4-channel group (or the higher channel group), reduce correlation between the channels, and after processing, it is still 4 (or more) channels and still is a channel pair, and at the same time, output control information; otherwise the multi-channel decorrelation function is not turned on.

The channel signal is coded differently according to certain parameters, such as coding rate requirements. If the coding code rate requirement is very low, a parameter stereo coding function and a channel pair broadband extension coding function in a channel core coder are started, the channel pair signal is subjected to parameter stereo coding to obtain a downmixed single-channel signal and output stereo parameters, then the high-frequency part in the downmixed single-channel signal is subjected to bandwidth extension coding to obtain high-frequency coding information and output bandwidth extension parameters, then the low-frequency part in the downmixed single-channel signal is subjected to certain perceptual audio coding to obtain low-frequency coding information, the high-frequency coding information and the low-frequency coding information are output, and the high-frequency coding information and the low-frequency coding information are used as a channel pair code stream together with the output stereo parameters and the bandwidth extension parameters.

If the coding code rate requirement is medium, the parameter stereo coding function is closed, the sound channel is started to perform the broadband extension coding function, the high-frequency part in the sound channel signal is subjected to bandwidth extension coding to obtain high-frequency coding information and output bandwidth extension parameters, then the low-frequency part in the sound channel signal is subjected to certain perception audio coding to obtain low-frequency coding information, the high-frequency coding information and the low-frequency coding information are output, and the high-frequency coding information and the low-frequency coding information are used as sound channel pair code streams together with the output bandwidth extension parameters.

If the coding code rate requirement is higher (or the sound quality is higher), the parameter stereo coding function and the sound channel pair broadband extension coding function are closed, certain perception audio coding is directly carried out on the sound channel pair signals, and the sound channel pair code stream is output.

Further, the method for generating the mid-high frequency chord signal in the bandwidth extension coding comprises the following steps:

carrying out complex orthogonal analysis filtering on the input single-sound-channel audio signal by using a complex orthogonal mirror image filter bank to obtain a plurality of sub-band signals with equal bandwidth;

performing complex linear predictive analysis filtering on each obtained sub-band signal to obtain a residual signal of each sub-band, obtaining a prediction coefficient, sequentially finishing the corresponding relation between all high-frequency sub-band residual signals and low-frequency sub-band residual signals, and copying and outputting sub-band residual copy parameters;

and quantizing and outputting the encoded prediction coefficients.

Further, the sequentially completing the corresponding relationship between all the high-frequency subband residual signals and the low-frequency subband residual signals, and encoding and outputting the subband residual copy parameters specifically includes:

analyzing the residual signal of each high-frequency sub-band, selecting the best low-frequency sub-band from the low-frequency sub-band residual signals, and coding and outputting the sub-band numbers of all the low-frequency sub-bands obtained by the method;

or, for a continuous set of high frequency subband residual signals, selecting an optimal continuous set of low frequency subbands from the low frequency subband residual signals, and encoding and outputting the start subband number and the end subband number of the multiple sets of low frequency subbands obtained by the optimal continuous set of low frequency subbands.

It should be noted that, the method for generating high-frequency chord signals in bandwidth extension coding of this embodiment performs CLPC analysis on high-frequency subbands and transmits prediction coefficients, so as to ensure accuracy of high-frequency envelope, and thus improve sound quality of high-frequency portions of audio signals.

Further, the method for generating high frequency details in bandwidth extension coding comprises:

determining the bandwidth of a low-frequency part to be copied and the bandwidth of a reconstructed high-frequency part when decoding in the input single-channel audio signal, and if the bandwidth of the reconstructed high-frequency part is larger than the bandwidth of the low-frequency part to be copied or the high-frequency part has a chord signal, taking the ratio of the bandwidth of the reconstructed high-frequency part to the bandwidth of the low-frequency part to be copied as the stretching factor and outputting the stretching factor;

performing time-frequency grid division according to transient characteristics of an input single-channel audio signal, calculating the spectrum envelope of each grid, finding out the shape most similar to the spectrum envelope from a preset template shape library, and encoding and outputting the label of the shape in the template shape library.

It should be noted that, in general, the high-frequency detail spectral coefficients are generated by copying from the low-frequency part, then performing filtering or spectral envelope shape adjustment, and finally performing gain adjustment (reconstructing the total energy of the high-frequency part). The bandwidth (or number of spectral lines) of the low frequency part to be copied is usually the same as the bandwidth (or number of spectral lines) of the high frequency part details of the target to be replaced.

However, when the audio coding rate is low, the frequency of the low-frequency coding part (usually adopting perceptual audio coding, such as AAC, DRA, etc.) is low (the low-frequency part of the audio coded by the kernel encoder is low), and when the high-frequency part to be coded by the bandwidth extension technique BWE is more (wide), the low-frequency part may be continuously copied twice or more, and at this time, the details of the reconstructed high-frequency spectral coefficients usually have a large deviation from the details of the original high-frequency spectral coefficients, thereby affecting the high-frequency reconstruction effect and finally reducing the overall subjective sound quality.

For strong harmonic audio signals, the audio signals contain rich higher harmonic components (overtones) in addition to the fundamental frequency signals, so that the whole audio signal sounds plump, smooth and bright (timbre). For the BWE encoding and decoding of the signals, because the high frequency contains a large number of chord signals, a large amount of encoding information is needed by encoding through independent chord signals, which cannot be guaranteed when encoding with low code rate; it is therefore important to reconstruct the high frequency details from how the low frequency is copied to the high frequency. Simple copying usually cannot ensure that fundamental tones and low harmonics in low-frequency spectral lines just replace higher harmonics of high-frequency parts of original audio signals, so that high-frequency distortion is brought by changing tone colors.

Therefore, in order to avoid the influence of the high frequency reconstruction effect during decoding, a stretching factor α -BW is defined during encoding_H/BW_LWherein the bandwidth of the low frequency part to be copied is BW_LThe bandwidth of the reconstructed high frequency part is BW_H. In decoding, that is, in reconstructing high frequency details, in the case of a large number of high frequency parts, details of high frequency part spectral coefficients can be obtained by one copy and stretch processing as shown in fig. 4. For strong harmonic audio signals, since the higher harmonics are usually located at the whole times of the fundamental frequency and the lower harmonics of the low frequency part, after selecting the low frequency part, copying to the high frequency to replace the original high frequency detail makes the copied fundamental frequency (when present) and the lower harmonics just fall on (or near) the higher harmonics by the stretching factor α, as shown in fig. 5, thus preserving the main higher harmonics of the high frequency partHarmonic waves are generated, and a plurality of independent chord signals do not need to be coded, so that a good high-frequency reconstruction effect is obtained, and high-frequency signal distortion at a low code rate is reduced. The stretching method of the spectral bandwidth (or spectral coefficient) may be implemented by a frequency domain interpolation method or an α -time resampling method.

In addition, in the bandwidth extension coding and decoding algorithm, the SBR technology is obtained by copying the low frequency part during the reconstruction of the high frequency signal details, and the low frequency part is obtained by simple 2-order filtering, and because the content of the replaced high frequency part is not considered, the envelope shape of the high frequency details obtained by this method is either the same as the low frequency part or is close to the flat spectrum of white noise after filtering. In addition, AMR-WB + obtains the spectral envelope of the high frequency portion by way of LPC (linear prediction) of the high frequency portion, but the calculation of LPC occupies a certain computational complexity, and the coding of the prediction coefficient needs to occupy more bit rate (since BWE technology is generally applied to low-bitrate audio coding, the bit rate occupied by LPC coefficient coding may cause insufficient bit rate of the low frequency portion and cause excessive low-frequency quantization distortion, which affects the overall subjective sound quality).

Therefore, the present embodiment provides a universal high-frequency spectrum envelope template shape library to simulate the spectrum envelope of the high-frequency part, which results in a more accurate spectrum envelope than a method of simply and directly moving the low-frequency part (copy) to obtain the details of the high-frequency part. Under the condition of low code rate, compared with an LPC method, the high-frequency spectrum envelope can be described by using less information; meanwhile, high-spectrum envelope reduction which is equivalent to or better than LPC can be provided through a larger template shape library when the code rate is increased.

Specifically, during encoding, time-frequency grid division is performed according to transient characteristics of signals, then the spectrum envelope of each grid is calculated, a shape most similar to the spectrum envelope is found in a template shape library, and a label of the shape in the template shape library is encoded into an envelope parameter.

The construction of the high-frequency partial spectrum envelope template shape library can be realized by performing a plurality of algorithms on the divided time-frequency grids, for example: (1) simple geometric figure construction, (2) fitting (linear and other methods) to the envelope of the high-frequency part, (3) vector quantization, or (4) LPC prediction filtering to obtain the envelope. Then, N (N is an integer power of 2, that is, N ═ 2 ^ M, M is an integer) conventional spectrum envelope shapes are obtained through statistical classification processing, and the shapes are labeled for convenient retrieval and encoded transmission. In addition, the template shape library can be designed into a layered mode, the deeper the layer is, the finer the spectrum envelope is, so that different audio coding code rates can conveniently describe the high-frequency spectrum envelope of the current frame by adopting different layers, and the optimal code rate self-adaptive high-frequency spectrum envelope reduction is obtained. One simple embodiment of constructing a library of template shapes from geometric figures: comprising 8 templates, which can be coded with 3 bits, as shown in fig. 6-13. The 8 templates can also be divided into 2 layers, where the first time is 3 templates (one line segment), the second layer is 5 templates (two line segments), the first layer represents the high-frequency spectral envelope in a rough manner; the second layer gives a finer high spectral envelope shape.

Further, the step S125 specifically includes:

packing the LFE sound channel code stream, the independent sound channel code stream and the sound channel code stream in a frame format according to a sound channel coding data structure, and outputting the sound channel code stream;

the data structure of the sound track code stream comprises frame header information, middle layer sound track coding information, control information of middle layer sound track BWE information, other layer sound track coding information, control information of other layer sound track BWE information and other layer sound track BWE information which are sequentially arranged; or, the data structure of the soundtrack code stream includes frame header information, soundtrack coding information, control information of the soundtrack BWE information, and the soundtrack BWE information, which are sequentially arranged.

It should be noted that there are two structures of the channel code stream, the first structure is shown in fig. 14, which is preceded by the middle layer channel coding information and the middle layer BWE (wideband Extension), and followed by the other layer (upper layer and bottom layer) channel coding information and the other layer BWE. This structure may be adapted to be compatible with conventional 2D audio coding data structures, e.g. when the channel signal in 3D audio is 5.1.4, the coding of the intermediate layer 5.1 is preceded and may be compatible with conventional 5.1 coding, i.e. a conventional 2D audio decoder may decode the 5.1 channels. It is noted that in such an architecture the adaptive multi-channel decorrelation function cannot be enabled, otherwise the compatibility would be destroyed. The second configuration is shown in fig. 15, with the channel coding information in front and the individual channels BWE in back.

Further, the step S130 specifically includes:

detecting whether an input target signal needs to be encoded with reference to the associated metadata;

if yes, when the related metadata indicate that the target signal of the frame has a signal, coding the target signal as an independent sound channel signal in the sound channel signal by adopting a sound channel core coding algorithm to obtain the target code stream;

if not, coding the target signal as an independent sound channel signal in the sound channel signal by adopting a sound channel core coding algorithm to obtain the target code stream.

As shown in fig. 2, a target signal that does not require metadata input is encoded by directly using a target encoder. At this time, the target encoder directly adopts the channel core encoding algorithm for encoding, and the encoding method is the same as the method for the channel core encoder to encode the independent channel signal in the channel signal, and is not described in detail herein.

When the target signal needs the associated metadata as input together into the target encoder, the target encoder may implement the encoding by modifying the channel core encoding algorithm. For example, when the metadata indicates the existence of the target signal (time parameter description, or 1bit per frame indicates, '1' indicates that the frame has a signal and '0' indicates that the target signal of the frame is muted), when the frame has a signal, the channel core coding processing mode of the independent channel signal is adopted for coding; otherwise, no coding is performed.

In addition, when there is direct correlation between multiple target signals, the multiple target signals may be grouped into a group, the group of target signals is first decorrelated, and then the processed signals are compressed and encoded as channel signals by using a channel core encoding method.

Further, the step S140 specifically includes:

when the input metadata is represented by floating points, quantization with different precisions is carried out according to the coding rate requirement of the metadata part, and entropy coding is carried out on quantized integer parameters to obtain the metadata code stream.

It should be noted that, when the input metadata signal is represented by a floating point, such as the spatial position of a target signal, quantization with different precision needs to be performed according to the code rate requirement of the metadata portion, and then entropy coding is performed on the quantized integer parameters to remove redundant information, where the entropy coding includes Huffman coding, arithmetic coding, and the like.

Further, in the step S150, the 3D audio data structure includes frame header information, channel coding information, target coding information, and metadata coding information arranged in sequence; or, the 3D audio data structure includes frame header information, channel coding information, metadata coding information related to the channel signal, target coding information, and metadata coding information related to the target signal, which are sequentially arranged.

It should be noted that the 3D audio code stream has two data structures. One structure is shown in fig. 16, which includes frame header information, which includes the basic information of the entire 3D audio (or the basic information of a part of the target audio may be moved to the frame header of the target encoding information), followed by channel signal encoding information, followed by target signal encoding information, and finally metadata encoding information. Another structure is shown in fig. 17, where the metadata is divided into two parts, which are placed after the channel signal coding information and the target signal coding information, respectively, in such a way that the whole data structure is more clear, but several bytes of redundancy is added.

In addition, the data structure of the target code stream includes frame header information, target coding information, control information of target BWE information, and target BWE information, which are sequentially arranged.

It should be noted that, when the input target signal only contains a single target, the data structure of the single target coding is as shown in fig. 18, where the frame header information contains the basic information of the target signal, when the code rate is low, the BWE of the single target is turned on, the coded data portion of the single target signal contains the compressed information of the low frequency portion of the current target signal, the BWE of the single target contains the parameter information of the high frequency portion, and the auxiliary information between them gives the control information of the BWE of the single target. When the code rate is high, only the frame header information and the coded data of a single target signal are contained (in this case, the full frequency band of the single target is coded).

When the input target signal includes multiple targets, the data structure of the multi-target coding is similar to that of the single-target coding, as shown in fig. 19, the frame header information includes the basic information of the target signal of the frame, when the code rate is higher, the BWE information does not exist, and then the core coding information of the multiple target signals after the frame header information includes the full-band coding of the multiple targets, which may be the sequential arrangement of the information coded by each target separately, or the sequential arrangement of the information coded by the partial related target signals as a whole and the information coded by other single targets after the joint coding. If the code rate is low, BWE coding is started, the core coding information of a plurality of target signals only contains the low-frequency part compression information of the target signals, the high-frequency part is placed behind the high-frequency part after the high-frequency BWE coding of each target, and the control information among the parts indicates the type, the length and the like of the BWE part.

It should be noted that, as shown in fig. 20, the data structure of the metadata code stream starts with metadata control information, and the original type and length are described, and then information is encoded for the metadata.

The following describes the 3D audio coding method provided by the embodiment of the present invention in detail by taking DRA-3D audio coding as an example.

As shown in fig. 21, a channel signal, a target signal and metadata are input, where the input channel signal is compressed by a DRA + V2 core encoder in a DRA-3D encoder to generate a channel code stream; compressing the target signal through a DRA + V2 target encoder in the DRA-3D encoder to form a target code stream; and the metadata is compressed into a metadata code stream through a DRA + V2 metadata encoder, and finally, the information of the three code streams is packaged into a DRA-3D code stream through a DRA-3D multiplexer.

As shown in fig. 22, the specific steps of the DRA + V2 core encoder for encoding the channel signal are as follows:

dividing an input channel signal into an LFE (low frequency effect channel) channel, an independent mono channel, and a channel pair;

firstly, 2 times of downsampling is carried out on an LFE sound channel, then DRA coding is carried out, and LFE sound channel coding information is output;

determining whether to start a bandwidth extension coding function or not for the independent sound channel according to parameter requirements such as coding code rate, and the like, if the code rate is higher, directly performing DRA coding without starting, and outputting coding information of the sound channel; if the code rate is low, starting a bandwidth extension coding function, adopting NELA-BWE coding for the high-frequency part of the sound channel, adopting DRA coding for the low-frequency part, and outputting low-frequency and high-frequency coding information;

firstly, performing NLEA self-adaptive multichannel decorrelation processing on all input channel pairs to a stereo (or channel pair) input signal, outputting the processed channel pairs, and outputting self-adaptive multichannel processing parameters; simultaneously carrying out MCR (Maximum Correlation Rotation) parameter stereo coding on the channel pairs (if the MCR coding function is started), and outputting MCR parameter information and downmixed channels; respectively encoding high frequency by NELA-BWE and low frequency by DRA for the down-mixed sound channel, and outputting low and high frequency encoding information;

the various parameters and coding information output in the above steps are packed in a data structure (as shown in fig. 15) of the 3D audio channel coding.

In addition, the DRA + V2 target encoder encodes each target signal directly using a DRA + V2 channel encoder, and the DRA + V2 metadata encoder entropy encodes using Huffman. And finally, packing the sound channel code stream, the target code stream and the metadata code stream in a frame format according to a 3D audio data structure (as shown in FIG. 16), and outputting the 3D audio code stream.

A more specific example is the case of 5.1.4 (5.1 channels and 4 upper channels in the middle layer) +4 target audio encoded at 384kbps total code rate, the encoding procedure is as follows:

(1) firstly, code rate allocation is carried out, and 24kbps 4-96 kbps is given to 4 target audios; metadata was given at 12 kbps; 5.1.4 channel signals were given 276 kbps;

(2) there are three ways to allocate the 5.1.4 code rate

a) And (3) fixed code rate allocation: according to the total code rate of the sound channel signals, the code rate of each sound channel is the total code rate multiplied by a coefficient, and the sum of all the coefficients is 1;

b) adaptive code rate allocation: according to the masking threshold calculated by the psychoacoustic model of each sound channel, self-adaptive distribution is carried out from the total code rate, and more code rates are obtained for the sound channels with complex signals;

c) distributing mixed code rate; on the basis of adaptive allocation, different channels are given different weighting coefficients, e.g. the center channel C is generally considered more important than LS & RS, the middle layer L & R is more important than the upper layer channels TopL & TopR, etc.;

(3) grouping 5 (LCR LS RS).1(LFE).4(TopL, TopR, TopLS, TopRS) channel signals;

(4) ". 1' as a single low frequency channel, DRA encoding;

(5) the C (center channel of the middle layer) channel is also used as an independent full-band channel, the low-frequency part is subjected to NELA-BWE coding, and the high-frequency part is subjected to DRA coding;

(6) the channel pairs may be: l & R, LS & RS, TopL & TopR, TopLS & TopRS, L & TopL, R & TopR, LS & TopLS, RS & TopRS, when coding, according to the correlation between each channel pair, selecting the channel pair with the largest correlation as a channel pair; then, based on the correlation between the two channel pairs, combining the two channel pairs to form a 4-channel group, for example, L & R and TopL & TopR as one 4-channel group, LS & Rs and TopLS & TopRS as the other 4-channel group;

(7) NELA adaptive multi-channel decorrelation processing is carried out on the 2 channel groups, and 4 channel pairs are output;

(8) NLEA-BWE coding is carried out on the high-frequency part of the sound channel of (4) or (5) (MCR parametric stereo coding is not started at a code rate of 384 kbps), and DRA coding is carried out on the low-frequency part of the sound channel;

(9) coding 4 target audio signals according to independent channels respectively, namely performing NLEA-BWE coding on a high-frequency part of the target audio and performing DRA coding on a low-frequency part of the target audio;

(10) performing Huffman coding on the metadata code rate;

(11) and multiplexing all the coding information according to the frame format of the figure 16 to form a DRA-3D audio code stream.

According to the embodiment of the invention, for the input sound channel signal, the target signal and the metadata, a sound channel core encoder is adopted to encode the sound channel signal, a target encoder is adopted to encode the target signal, a metadata encoder is adopted to encode the metadata, and the encoded sound channel code stream, the target code stream and the metadata code stream are combined into the 3D audio code stream, so that the high-efficiency encoding of the 3D audio code stream is realized.

Example two

An embodiment of the present invention provides a 3D audio decoding method, and referring to fig. 23, the method includes the following steps:

the data structure of the 3D audio code stream comprises frame header information, sound channel coding information, target coding information and metadata coding information which are sequentially arranged; or the data structure of the 3D audio code stream comprises frame header information, sound channel coding information, metadata coding information related to the sound channel signal, target coding information and metadata coding information related to the target signal which are sequentially arranged;

It should be noted that the 3D audio code stream is split (demultiplexed) into a sound channel signal code stream, a target code stream, and a metadata code stream. As shown in fig. 24, the channel code stream is decoded by the channel core decoder to output the channel signal, the target code stream is decoded by the target decoder (where part of the metadata may be used) to obtain the target signal, the metadata code stream is decoded by the metadata decoder to obtain the metadata, and finally the channel signal, the target signal and the related metadata are output to the speaker or the earphone for playing after being processed by the renderer/mixer according to the user interaction information.

Further, the step S220 specifically includes:

s221, splitting the sound channel code stream into an LFE sound channel code stream, an independent sound channel code stream and a sound channel code stream;

s222, carrying out sensory audio decoding on the LFE sound channel code stream, and carrying out 2-time upsampling to obtain an LFE sound channel signal;

s223, decoding the independent sound channel code stream to obtain an independent sound channel signal;

s224, decoding the channel pair code stream to obtain a channel pair signal;

and S225, outputting the LFE channel signal, the independent channel signal and the channel pair signal as the channel signal.

It should be noted that the decoding of the channel code stream is divided into independent channel decoding, channel pair decoding, and LFE channel decoding. In the LFE channel decoding, a perceptual audio decoding is performed on the LFE channel code stream to obtain an LFE low-frequency signal, and then the LFE low-frequency signal is obtained by directly performing 2-fold upsampling, as shown in fig. 25. The channel pair decoding is to decode the channel pair code stream, the independent channel decoding is to decode the independent channel code stream, and the LFE channel signal, the independent channel signal and the channel pair signal obtained by decoding form a multi-channel audio signal, namely, the channel signal is output.

In a preferred embodiment, the step S223 specifically includes:

carrying out waveform decoding on a low-frequency code stream in the independent sound channel code stream, and carrying out waveform parameter decoding on a high-frequency code stream in the independent sound channel code stream to obtain an independent sound channel signal;

the step S224 specifically includes:

and performing waveform decoding on the low-frequency code stream in the sound channel pair code stream, and performing waveform parameter decoding on the high-frequency code stream in the sound channel pair code stream to obtain a sound channel pair signal.

In this embodiment, the decoding process is as follows: (1) and demultiplexing the code stream into a low-frequency code stream and a high-frequency code stream.

(2) And inputting the low-frequency coding code stream into a low-frequency decoding module to obtain a low-frequency time domain signal. The decoding method corresponds to the encoding method, i.e. any waveform decoding, such as inductive audio decoding.

(3) And inputting the high-frequency coding code stream into a high-frequency decoding module to obtain a high-frequency time domain signal. The decoding method corresponds to the encoding method, namely, any waveform parameter is decoded.

(4) Inputting the low-frequency time domain signal into an LF-CQMF analysis module to obtain a low-frequency CQMF sample x_lf[k][n]。

(5) Inputting the high-frequency time domain signal into an HF-CQMF analysis module, modulating the high-frequency time domain signal to obtain a high-frequency CQMF sample x_hf[k][n]。

(6) Low frequency CQMF sample x_lf[k][n]And high frequency CQMF samples x_hf[k][n]Merge into full-band CQMF samples x [ k ]][n]。

(7) And inputting the full-band CQMF sample x [ k ] [ n ] into a CQMF synthesis module to obtain a full-band time domain sample.

In another preferred embodiment, the step S223 specifically includes:

s231, detecting whether the independent sound channel code stream has a bandwidth expansion parameter, if so, executing a step S232, and if not, executing a step S233;

s232, carrying out sensory audio decoding on the low-frequency code stream of the independent sound channel code stream to obtain a low-frequency signal; performing bandwidth expansion decoding on the high-frequency code stream of the independent sound channel code stream according to the bandwidth expansion parameter to obtain a high-frequency signal; taking the low frequency signal and the high frequency signal as the independent channel signals;

s233, carrying out sensory audio decoding on the independent sound channel code stream to obtain the independent sound channel signal;

the step S24 specifically includes:

s241, detecting whether the sound channel pairing code stream has a stereo parameter and a bandwidth expansion parameter; if the stereo parameter and the bandwidth extension parameter exist, performing step S242, if only the bandwidth extension parameter exists, performing step S243, if the stereo parameter and the bandwidth extension parameter do not exist, performing step S244;

s242, carrying out perceptual audio decoding on the low-frequency code stream of the sound channel to obtain a low-frequency signal; performing bandwidth expansion decoding on the high-frequency code stream of the sound channel pair code stream according to the bandwidth expansion parameter to obtain a high-frequency signal; performing parameter stereo decoding on the high-frequency signal and the low-frequency signal according to the stereo parameters to obtain a full-band audio signal;

s243, carrying out sensory audio decoding on the low-frequency code stream of the sound channel pair code stream to obtain a low-frequency signal; performing bandwidth expansion decoding on the high-frequency code stream of the sound channel pair code stream according to the bandwidth expansion parameter to obtain a high-frequency signal; taking the low-frequency signal and the high-frequency signal as full-band audio signals;

s244, carrying out sensory audio decoding on the code stream by the sound channel to obtain a full-band audio signal;

s245, detecting whether the self-adaptive multi-channel decoding function in the channel core decoder is started; if so, performing adaptive multi-channel decoding on the full-band audio signal to obtain the channel pair signal, and if not, taking the full-band audio signal as the channel pair signal.

It should be noted that there are two decoding modes for the independent channel code stream, as shown in fig. 25, if there is a bandwidth extension parameter in the independent channel code stream, the bandwidth extension decoding function in the channel core decoder is turned on, first, the low-frequency code stream of the independent channel code stream is subjected to perceptual audio decoding to obtain a low-frequency signal, and then, the high-frequency code stream of the independent channel code stream is subjected to bandwidth extension decoding to obtain a high-frequency signal, so as to implement independent channel decoding; if the independent sound channel code stream does not have the bandwidth expansion parameter, the bandwidth expansion decoding function in the sound channel core decoder is closed, and the independent sound channel code stream is directly subjected to perceptual audio decoding, so that the independent sound channel decoding is realized.

As shown in fig. 25, if the channel pair code stream has stereo parameters and bandwidth extension parameters, the bandwidth extension decoding function and the parameter stereo decoding function in the channel core decoder are turned on, a downmixed low-frequency signal is obtained by decoding the perceptual audio, a high-frequency signal is obtained by decoding the bandwidth extension, and finally a full-band audio signal is obtained by decoding the parametric stereo; if the sound channel pair code stream only has bandwidth expansion parameters, the bandwidth expansion decoding function in the sound channel core decoder is started and the parameter stereo decoding function is closed, firstly, the sensory audio is used for decoding to obtain a low-frequency signal, and then, the bandwidth expansion decoding is used for obtaining a high-frequency signal; if the sound channel pair code stream does not have the stereo parameters and the bandwidth expansion parameters, the bandwidth expansion decoding function and the parameter stereo decoding function in the sound channel core decoder are closed, and the full-band audio signals are obtained by directly utilizing the sensory audio decoding. Finally, inputting the full-band audio signal into a module with a self-adaptive multi-channel decoding function, and if the self-adaptive multi-channel decoding function is closed, enabling the full-band audio signal to pass through the module without loss; if the adaptive multi-channel decoding function is started, the full-band audio signal is subjected to adaptive multi-channel decoding to obtain a channel pair signal.

Further, the method for generating the high-frequency chord signal in the bandwidth extension decoding comprises the following steps:

performing complex orthogonal filter bank analysis filtering on the low-frequency signal obtained by decoding to obtain a low-frequency sub-band signal;

performing complex linear predictive analysis filtering on the low-frequency subband signal to obtain a low-frequency subband residual signal;

decoding and inverse quantizing the prediction coefficients;

copying a low-frequency sub-band residual error signal to a high-frequency sub-band residual error signal by using a sub-band residual error copy parameter obtained by decoding, and then performing linear predictive synthesis filtering on a high-frequency sub-band according to a prediction coefficient to obtain a high-frequency sub-band detail signal; the prediction coefficient and the sub-band residual error copy parameter are parameters output when high-frequency chord signals in bandwidth extension coding are generated;

and adjusting the high-frequency sub-band detail signal through the high-frequency envelope, and outputting the high-frequency sub-band signal.

It should be noted that, in the method for generating mid-high frequency chord signals in bandwidth extension decoding according to this embodiment, linear predictive synthesis filtering of the high frequency subband is excited by using the most suitable low frequency residual signal in the low frequency subband signals instead of the high frequency subband residual signal, so that a better high frequency chord signal can be obtained, and the sound quality of the high frequency part of the audio signal can be improved.

Further, the method for generating high-frequency details in bandwidth extension decoding comprises:

copying and stretching the low-frequency part to be copied according to the stretching factor to obtain a high-frequency detail spectral coefficient;

finding a spectrum envelope template corresponding to the shape number from a preset template shape library, and carrying out envelope adjustment on the high-frequency detail spectrum coefficient through the spectrum envelope template to obtain a high-frequency detail signal; the stretching factor and the shape number are parameters output when high-frequency details are generated in bandwidth extension coding.

It should be noted that, during decoding, first, a corresponding spectrum envelope template is found from a template shape library according to envelope parameters, i.e., shape labels, then copied from a low-frequency spectrum to a high-frequency part, and subjected to decorrelation processing (obtaining a signal with a flat spectrum) and normalization processing (removing gains), and finally, spectral coefficients are subjected to envelope adjustment through the spectrum envelope template, so as to reconstruct details of the high-frequency part of the audio signal.

Further, the step S230 specifically includes:

detecting whether the target code stream needs to be decoded by referring to the related metadata;

if yes, when the related metadata indicate audio, decoding the target code stream as an independent sound channel code stream in the sound channel code stream to obtain the target signal;

if not, the target code stream is used as an independent sound channel code stream in the sound channel code stream for decoding, and the target signal is obtained.

It should be noted that, as shown in fig. 24, when metadata is not needed, a simple method for decoding the object code stream is to decode each audio object directly as an independent channel. When the metadata is needed, the decoding method of the independent sound channel is simply modified by using the metadata related to the target audio to finish the decoding of the target code stream. For example, the metadata indicates whether there is an indication of the current target audio, and if there is a current frame of audio that is decoded by using a decoding method of an independent channel, details are not repeated herein; if not, a frame is muted directly (PCM of value 0).

In addition, there is a more complicated decoding situation of the target code stream, that is, if there is correlation between some target audio signals in the target code stream, the correlation of these several targets can be used for compression at the encoding end, and the several targets need to be jointly decoded at the decoding end.

In step S240, when the metadata is represented by floating point, such as the spatial position of the target signal, quantization should be performed first at the encoding end, and the metadata of the integer should be entropy encoded (such as Huffman encoding). Similarly, the decoding end should correspondingly decode the metadata code stream to recover each metadata parameter for the target code stream decoding and the mixer/renderer.

Further, in the step S250, as shown in fig. 24, the mixer/renderer inputs the channel signal, the target signal and the metadata, and at the same time, may input user information (such as the current speaker configuration, etc.). The mixer/renderer may render the channel signal and the target signal to an actually played speaker (according to a configuration or a standard configuration given by a user) by using an algorithm such as vbap (vector Base Amplitude planning), etc., to obtain a better 3D sound field reconstruction, or may render the channel signal and the target signal to headphones by using an algorithm such as hrtf (head Related Transfer function), etc., to reconstruct a 3D sound field.

The following describes the 3D audio decoding method provided by the embodiment of the present invention in detail by taking DRA-3D audio decoding as an example.

As shown in fig. 26, the input signal is a DRA-3D code stream, and after three code streams are obtained by splitting, the sound channel code stream is processed by a DRA + V2(DRA2.0 version) core decoder to obtain a sound channel signal; the target code stream is processed by a DRA + V2 target decoder to obtain a target signal; and the metadata code stream is processed by a DRA + V2 metadata decoder to obtain metadata. And finally, according to the user interaction information, rendering processing (adopting standard VBAP and HRTF technologies) is carried out through a DRA-3D renderer/mixer to obtain output signals, wherein one output signal is directly played for a loudspeaker, and the other output signal is fed to an earphone for playing.

As shown in fig. 27, the specific operating principle of the DRA + V2 core decoder is as follows:

decoding the DRA-3D sound channel code stream, and obtaining each independent sound channel code stream, sound channel pair code stream, four sound channel pair code stream and LFE sound channel code stream by splitting the sound channel information part;

DRA decoding is carried out on the LFE sound channel code stream, then 2 times of upsampling is carried out, and an LFE sound channel PCM signal is output;

DRA decoding is carried out on the independent sound track code stream, if the bandwidth expansion function is started, NELA-BWE decoding is continued to be carried out on the high-frequency part of the independent sound track code stream, and an independent sound track PCM signal is output;

DRA decoding the code stream of the sound channel pair, and directly outputting a PCM signal of the sound channel pair if a sound channel pair bandwidth expansion function is not enabled (generally, under the conditions of high code rate or high quality); if the bandwidth extension function of the sound channel is enabled, but the parametric stereo function is not enabled (usually, under the condition of medium code rate), performing sound channel pair NELA-BWE decoding after the sound channel performs DRA decoding on the code stream to obtain a sound channel pair PCM signal; if the channel pair bandwidth extension function is enabled and the parametric stereo function is enabled (usually, in the case of low bit rate), DRA decoding is performed on a single-channel low-frequency part of the downmix, then NELA-BWE decoding is performed through the channels to obtain a single-channel full-band signal of the downmix, then MCR (Maximum Correlation Rotation) decoding is performed to obtain a channel pair PCM signal, and finally it is determined whether the NELA adaptive multi-channel decoding function is turned on, if not, the channel pair PCM signal is directly output, and if turned on, the NELA adaptive multi-channel decoding is performed through two channel pairs to output a 4-channel PCM signal (or two channel pair signals).

Wherein, the DRA + V2 target decoder directly uses the DRA + V2 sound channel decoder to decode the target code stream. The DRA + V2 metadata decoder uses Huffman decoding on the metadata stream.

A more specific example is the decoding case for 5.1.4 (5.1 channels in the middle layer and 4 upper layer channels) +4 target audio at 384kbps total code rate and played in a standard configuration 5.1.4 speaker environment, the decoding process is as follows:

(1) splitting the 3D code stream to obtain a 5.1.4 sound channel code stream, 4 target audio code streams and a metadata code stream;

(2) performing Huffman decoding on the metadata code rate to obtain original metadata information;

(3) DRA decoding LFE in 5(L C R LS RS).1(LFE).4(TopL, TopR, TopLS, TopRS), up-sampling by 2 times and outputting LFE channel PCM signal;

(4) DRA + V2 independent channel decoding is carried out on the C channel in 5(L C R LS RS) 1(LFE) 4(TopL, TopR, TopLS, TopRS) to obtain a C channel PCM signal;

(5) carrying out DRA + V2 channel pair decoding on L, R, LS, RS and TopL, TopR, TopLS and TopRS in 5(L C R LS RS) 1(LFE) 4(TopL, TopR, TopLS, TopRS), namely decoding L & R, LS & RS, TopL & TopR and TopRS on four channels to obtain PCM signals of L, R, LS, RS, TopL, TopR, TopLS and TopRS channels;

(6) respectively carrying out independent channel DRA + V2 decoding on the 4 target audio code streams to obtain 4 target audio PCM signals;

(7) in a DRA mixer/renderer, rendering 4 target signals to 5.1.4 channels by adopting a VBAP algorithm according to relevant metadata information of the 4 target signals, and then mixing the 4 target rendered signals to the original 5.1.4 channels;

(8) finally, the 5.1.4 channel audio PCM signal is fed to a standard 5.1.4 loudspeaker system for playing.

If the audio is to be played through headphones, processing the 5.1.4 channels (according to the spatial position of each channel) by using HRTFs (or spatial Room reverberation BRIRs) after step (7) to obtain Binaural signals, and playing through headphones; in addition, after the step (6), HRTF (or BRIR) processing can be performed on the 5.1.4 channels and the target respectively to obtain binaural signals, and the binaural signals can be played through headphones.

According to the embodiment of the invention, the input 3D audio code stream is divided into the sound channel code stream, the target code stream and the metadata code stream, the sound channel core decoder is used for decoding the sound channel code stream, the target decoder is used for decoding the target code stream, the metadata code stream is decoded by the metadata decoder, the sound channel signal, the target signal and the metadata are rendered, and the high-efficiency decoding of the 3D audio code stream is realized.

EXAMPLE III

An embodiment of the present invention provides a 3D audio encoding apparatus, which is capable of implementing all the processes of the 3D audio encoding method according to the first embodiment, and referring to fig. 28, the 3D audio encoding apparatus includes:

a first input module 301 for inputting a channel signal, a target signal and metadata;

a sound channel core encoder 302, configured to encode the sound channel signal by using a sound channel core encoding algorithm, so as to obtain a sound channel code stream;

the target encoder 303 is configured to encode the target signal to obtain a target code stream;

a metadata encoder 304, configured to encode the metadata to obtain a metadata code stream; and the number of the first and second groups,

an output module 305, configured to perform frame format packing on the sound track code stream, the target code stream, and the metadata code stream according to a 3D audio data structure, and output the 3D audio code stream;

Example four

An embodiment of the present invention provides a 3D audio decoding apparatus, which is capable of implementing all the processes of the 3D audio decoding method according to the second embodiment, with reference to fig. 29, where the 3D audio decoding apparatus includes:

a second input module 401, configured to input a 3D audio code stream, and split the 3D audio code stream into a channel code stream, a target code stream, and a metadata code stream;

a sound channel core decoder 402, configured to decode the sound channel code stream to obtain a sound channel signal;

a target decoder 403, configured to decode the target code stream to obtain a target signal;

a metadata decoder 404, configured to decode the metadata code stream to obtain metadata; and the number of the first and second groups,

a renderer 405, configured to render the channel signal and the target signal according to the metadata, and output the rendered signals to a corresponding terminal according to user interaction information for playing;

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for 3D audio coding, comprising the steps of:

s110, inputting a channel signal, a target signal and metadata;

the data structure of the metadata code stream comprises metadata control information and metadata coding information which are sequentially arranged;

wherein, the step S120 specifically includes:

s125, packing the LFE sound channel code stream, the independent sound channel code stream and the sound channel pair code stream according to a sound channel coding data structure in a frame format, and outputting the sound channel code stream;

the step S130 specifically includes:

if not, coding the target signal as an independent sound channel signal in the sound channel signal by adopting a sound channel core coding algorithm to obtain the target code stream;

the step S140 specifically includes:

2. The 3D audio encoding method of claim 1, wherein the step S123 specifically includes:

the step S124 specifically includes:

3. The 3D audio encoding method of claim 1, wherein the step S123 specifically includes:

s133, performing bandwidth extension coding on a high-frequency part in the independent channel signal to obtain a bandwidth extension parameter and high-frequency coding information; carrying out sensory audio coding on a low-frequency part in the independent sound channel signal to obtain low-frequency coding information; taking the bandwidth extension parameter, the high-frequency coding information and the low-frequency coding information as the independent sound channel code stream;

the step S124 specifically includes:

4. 3D audio coding method according to claim 3, characterized in that the method of generating a mid-high frequency chordal signal in bandwidth extension coding comprises:

and quantizing and outputting the encoded prediction coefficients.

5. A method for 3D audio coding as claimed in claim 3 or 4, characterized in that the method for generation of high frequency details in bandwidth extension coding comprises:

determining the bandwidth of a low-frequency part to be copied and the bandwidth of a reconstructed high-frequency part when decoding is carried out on the input single-channel audio signal, and if the bandwidth of the reconstructed high-frequency part is larger than the bandwidth of the low-frequency part to be copied or the high-frequency part has a chord signal, taking the ratio of the bandwidth of the reconstructed high-frequency part to the bandwidth of the low-frequency part to be copied as a stretching factor and outputting the stretching factor;

6. A method for 3D audio decoding, comprising the steps of:

wherein, the step S220 specifically includes:

s224, decoding the channel pair code stream to obtain a channel pair signal;

s225, outputting the LFE channel signal, the independent channel signal, and the channel pair signal as the channel signal;

the step S230 specifically includes:

7. The 3D audio decoding method according to claim 6, wherein the step S223 specifically includes:

the step S224 specifically includes:

8. The 3D audio decoding method according to claim 6, wherein the step S223 specifically includes:

the step S224 specifically includes:

9. The 3D audio decoding method according to claim 8, wherein the generating method of the high frequency sinusoidal signal in the bandwidth extension decoding includes:

decoding and inverse quantizing the prediction coefficients;

10. 3D audio decoding method according to claim 8 or 9, wherein the method of generation of high frequency details in bandwidth extension decoding comprises:

11. A3D audio encoding apparatus implementing the 3D audio encoding method of any one of claims 1 to 5, characterized in that the 3D audio encoding apparatus comprises:

wherein the channel core encoder is further configured to: dividing an input sound channel signal into an LFE sound channel signal, an independent sound channel signal and a sound channel pair signal; 2 times of down sampling is carried out on the LFE sound channel signal, and a sensory audio code is adopted for compression, so as to obtain an LFE sound channel code stream; coding the independent sound channel signal to obtain an independent sound channel code stream; coding the sound channel pair signal to obtain a sound channel pair code stream; packing the LFE sound channel code stream, the independent sound channel code stream and the sound channel code stream in a frame format according to a sound channel coding data structure, and outputting the sound channel code stream;

the target encoder is further to: detecting whether an input target signal needs to be encoded with reference to the associated metadata; if yes, when the related metadata indicate that the target signal of the frame has a signal, coding the target signal as an independent sound channel signal in the sound channel signal by adopting a sound channel core coding algorithm to obtain the target code stream; if not, coding the target signal as an independent sound channel signal in the sound channel signal by adopting a sound channel core coding algorithm to obtain the target code stream;

the metadata encoder is further to: when the input metadata is represented by floating points, quantization with different precisions is carried out according to the coding rate requirement of the metadata part, and entropy coding is carried out on quantized integer parameters to obtain the metadata code stream.

12. A 3D audio decoding apparatus implementing the 3D audio decoding method according to any one of claims 6 to 10, the 3D audio decoding apparatus comprising:

wherein the channel core decoder is further to: splitting the sound channel code stream into an LFE sound channel code stream, an independent sound channel code stream and a sound channel code stream; carrying out perceptual audio decoding on the LFE sound channel code stream, and carrying out 2-time upsampling to obtain an LFE sound channel signal; decoding the independent sound channel code stream to obtain an independent sound channel signal; decoding the channel pair code stream to obtain a channel pair signal; outputting the LFE channel signal, the independent channel signal, and the channel pair signal as the channel signal;

the target decoder is further to: detecting whether the target code stream needs to be decoded by referring to the related metadata; if yes, when the related metadata indicate audio, decoding the target code stream as an independent sound channel code stream in the sound channel code stream to obtain the target signal; if not, the target code stream is used as an independent sound channel code stream in the sound channel code stream for decoding, and the target signal is obtained.