CN101258538B

CN101258538B - Method of encoding and decoding an audio signal

Info

Publication number: CN101258538B
Application number: CN2006800263119A
Authority: CN
Inventors: 吴贤午; 郑亮源; 房熙锡; 金东秀; 林宰显
Original assignee: LG Electronics Inc
Current assignee: LG Electronics Inc
Priority date: 2005-05-26
Filing date: 2006-05-26
Publication date: 2013-06-12
Anticipated expiration: 2026-05-26
Also published as: CN101258538A; CN101180674B; CN101223579B; CN101253550B; CN101180674A; CN101253550A; CN101223579A

Abstract

An apparatus for encoding and decoding an audio signal and method thereof are disclosed, by which compatibility with a player of a general mono or stereo audio signal can be provided in coding an audio signal and by which spatial information for a multi-channel audio signal can be stored or transmitted without a presence of an auxiliary data area. The present invention includes extracting side information embedded in non-recognizable component of audio signal components and decoding the audio signal using the extracted side information.

Description

Method for encoding and decoding audio signal

Technical Field

The invention relates to a method for coding and decoding an audio signal.

Background

Recently, much work has been conducted to research and develop various encoding schemes and methods for digital audio signals, and many products associated with the various encoding schemes and methods have been manufactured.

Also, encoding schemes for changing a mono or stereo audio signal into a multi-channel audio signal using spatial information of the multi-channel audio signal have been developed.

However, in the case of storing an audio signal in some recording media, there is no auxiliary data area for storing spatial information. Therefore, in this case, since only a mono or stereo audio signal is stored and transmitted, only the mono or stereo audio signal is reproduced. Therefore, the sound quality is monotonous.

Furthermore, in the case of separately storing or transmitting spatial information, there is a compatibility problem with a player of a general mono or stereo audio signal.

Disclosure of Invention

Accordingly, the present invention is directed to an apparatus for encoding and decoding an audio signal and method thereof that substantially obviate one or more problems due to limitations and disadvantages of the related art.

An object of the present invention is to provide an apparatus for encoding and decoding an audio signal and method thereof, by which compatibility with a player of a general mono or stereo audio signal can be provided when the audio signal is encoded.

Another object of the present invention is to provide an apparatus for encoding and decoding an audio signal and method thereof, by which spatial information of a multi-channel audio signal can be stored or transmitted without an auxiliary data area.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

To achieve these and other advantages and in accordance with the purpose of the present invention, a method of decoding an audio signal according to the present invention includes: a step of extracting side information (side information) in a case where the side information is embedded in the audio signal in frame units having a frame length defined in frames or in a case where the side information is attached to the audio signal in frame units; and a step of decoding the audio signal using the extracted side information.

To further achieve these and other advantages and in accordance with the purpose of the present invention, a method for encoding an audio signal according to the present invention includes: a step of generating the audio signal and side information required for decoding the audio signal; and performing the step of embedding the side information in the audio signal in frame units having a frame length defined in frames or the step of attaching the side information to the audio signal in frame units.

To further achieve these and other advantages and in accordance with the purpose of the present invention, a data structure according to the present invention includes: an audio signal and side information embedded in an unrecognizable component of the audio signal in frame units having a frame length defined in frames or attached to a region not used for decoding the audio signal in frame units.

To further achieve these and other advantages and in accordance with the purpose of the present invention, an apparatus for encoding an audio signal according to the present invention includes: an audio signal generating unit for generating an audio signal; an auxiliary information generating unit for generating auxiliary information required to decode the audio signal; and an auxiliary information attaching unit for performing a process of embedding the auxiliary information in the audio signal in frame units having a frame length defined in frames or a process of attaching the auxiliary information to the audio signal in frame units.

To further achieve these and other advantages and in accordance with the purpose of the present invention, an apparatus for decoding an audio signal according to the present invention includes: an auxiliary information extraction unit for extracting auxiliary information in a case where the auxiliary information is embedded in the audio signal in frame units having a frame length defined in frames or in a case where the auxiliary information is attached to the audio signal in frame units; and a multi-channel generating unit for decoding the audio signal by using the side information.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.

Brief Description of Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

In the drawings:

fig. 1 is a diagram for explaining a method of human body recognizing spatial information of an audio signal according to the present invention;

FIG. 2 is a block diagram of a spatial encoder according to the present invention;

FIG. 3 is a detailed block diagram of an embedding unit for configuring the spatial encoder shown in FIG. 2 according to the present invention;

fig. 4 is a diagram of a first method for rearranging a spatial information bitstream according to the present invention;

fig. 5 is a diagram of a second method for rearranging a spatial information bitstream according to the present invention;

fig. 6A is a diagram of a reshaped spatial information bitstream according to the present invention;

fig. 6B is a detailed diagram of a configuration of a spatial information bitstream shown in fig. 6A;

FIG. 7 is a block diagram of a spatial decoder according to the present invention;

fig. 8 is a detailed block diagram of an embedded signal decoder included in a spatial decoder according to the present invention;

fig. 9 is a diagram for explaining a case where a general PCM decoder reproduces an audio signal according to the present invention;

fig. 10 is a flowchart of an encoding method for embedding spatial information in a downmix (downmix) signal according to the present invention;

fig. 11 is a flowchart of a method for decoding spatial information embedded in a downmix signal according to the present invention;

fig. 12 is a diagram of a frame size of a spatial information bitstream embedded in a downmix signal according to the present invention;

fig. 13 is a diagram of a spatial information bitstream embedded in a downmix signal at a fixed size according to the present invention;

fig. 14A is a diagram for explaining a first method for solving a temporal alignment problem of a spatial information bitstream embedded at a fixed size;

fig. 14B is a diagram for explaining a second method for solving a temporal alignment problem of a spatial information bitstream embedded at a fixed size;

fig. 15 is a diagram of a method for attaching a spatial information bitstream to a downmix signal according to the present invention;

fig. 16 is a flowchart of a method for encoding spatial information bitstreams embedded in a downmix signal at different sizes according to the present invention;

fig. 17 is a flowchart of a method for encoding a spatial information bitstream embedded in a downmix signal at a fixed size according to the present invention;

fig. 18 is a diagram of a first method of embedding a spatial information bitstream into an audio signal downmixed on at least one channel according to the present invention;

fig. 19 is a diagram of a second method of embedding a spatial information bitstream into an audio signal downmixed on at least one channel according to the present invention;

fig. 20 is a diagram of a third method of embedding a spatial information bitstream into an audio signal downmixed on at least one channel according to the present invention;

fig. 21 is a diagram of a fourth method of embedding a spatial information bitstream into an audio signal downmixed on at least one channel according to the present invention;

fig. 22 is a diagram of a fifth method of embedding a spatial information bitstream into an audio signal downmixed on at least one channel according to the present invention;

fig. 23 is a diagram of a sixth method of embedding a spatial information bitstream into an audio signal downmixed on at least one channel according to the present invention;

fig. 24 is a diagram of a seventh method of embedding a spatial information bitstream into an audio signal downmixed on at least one channel according to the present invention;

fig. 25 is a flowchart of a method for encoding a spatial information bitstream to be embedded in an audio signal downmixed on at least one channel according to the present invention; and

fig. 26 is a flowchart of a method for decoding a spatial information bitstream to be embedded in an audio signal downmixed on at least one channel according to the present invention;

Detailed Description

Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings.

First of all, the present invention relates to an apparatus for embedding auxiliary information required for decoding an audio signal into the audio signal and a method thereof. For convenience of explanation, the audio signal and the side information are respectively represented by a downmix signal and spatial information in the following description, which do not limit the present invention in any way. In this case, the audio signal includes a PCM signal.

Fig. 1 is a diagram for explaining a method of human body recognizing spatial information of an audio signal according to the present invention.

Referring to fig. 1, an encoding scheme for a multi-channel audio signal takes advantage of the fact that an audio signal can be represented as 3-dimensional spatial information through a plurality of parameter settings, based on the fact that a human body can recognize the audio signal in 3-dimensions.

Spatial parameters for representing spatial information of a multi-channel audio signal include CLD (channel level difference), ICC (inter-channel coherence), CTD (channel time difference), and the like. CLD represents an energy difference between two channels, ICC represents a correlation between two channels, and CTD represents a time difference between two channels.

The concept of how a human body spatially recognizes an audio signal and how spatial parameters are generated is explained with reference to fig. 1.

A direct sound wave 103 reaches the left ear of the human body from a remote sound source 101, while another direct sound wave 102 is diffracted around the head to reach the right ear 106 of the human body.

The two

sound waves

102 and 103 differ from each other in arrival time and energy level. And, CTD and CLD parameters are generated by using the differences.

If the reflected

sound waves

104 and 105 reach the two ears, respectively, or if the sound sources are scattered, sound waves having no correlation with each other will reach the two ears, respectively, to generate the ICC parameter.

Using the spatial parameters generated according to the above principle, it is possible to transmit a multi-channel audio signal as a mono or stereo signal and output the signal as a multi-channel signal.

The present invention provides a method of embedding spatial information, i.e., spatial parameters, in a mono or stereo audio signal, transmitting the embedded signal, and reproducing the transmitted signal as a multi-channel audio signal. The invention is not limited to multi-channel audio signals. In the following description of the present invention, a multi-channel audio signal is explained for convenience of explanation.

Fig. 2 is a block diagram of an encoding apparatus according to the present invention.

Referring to fig. 2, the encoding apparatus according to the present invention receives a multi-channel audio signal 201. In this case, 'n' refers to the number of input channels.

The multi-channel audio signal 201 is converted into downmix signals (Lo and Ro)205 by an audio signal generation unit 203. The downmix signal comprises a mono or stereo audio signal and may be a multi-channel audio signal. In the present invention, a stereo audio signal will be used as an example in the following description. However, the present invention is not limited to stereo audio signals.

Spatial information of a multi-channel audio signal, i.e. spatial parameters, is generated from the multi-channel audio signal 201 by the side information generating unit 204. In the present invention, the spatial information indicates information of audio signal channels used when transmitting a downmix signal 205 generated by downmixing a multi-channel (e.g., left, right, center, left surround, right surround, etc.) signal and upmixing (upmix) the transmitted downmix signal again into a multi-channel audio signal. Optionally, the downmix signal 205 may be generated using a downmix signal such as the artistic downmix signal 202 provided directly from the outside.

The spatial information generated in the auxiliary information generating unit 204 is encoded into a spatial information bitstream by the auxiliary information encoding unit 206 for transmission and storage.

The spatial information bitstream is appropriately shaped to be directly inserted into the audio signal, i.e., the downmix signal 205 to be transmitted, through the embedding unit 207. In doing so, a 'digital audio embedding method' may be used.

For example, unlike the case of compression encoding by AAC or the like, in the case where the downmix signal 205 is an original PCM audio signal to be stored in a storage medium (e.g., a stereo compact disc) in which spatial information is difficult to be stored or to be transmitted through SPDIF (Sony/Philips) digital interface), there is no auxiliary data field for storing the spatial information.

In this case, if the "digital audio embedding method" is used, the spatial information may be embedded in the original PCM audio signal without quality distortion. Also, the audio signal in which the spatial information is embedded does not differ from the original signal in the view of a general decoder. That is, the output signal Lo '/Ro' 208 in which spatial information is embedded may be considered to be the same signal as the input signal Lo/Ro205 in view of a general PCM decoder.

As the 'digital audio embedding method', there are 'a bit substitution coding method', 'an echo hiding method', 'a spread spectrum based method', and the like.

The bit substitution coding method is a method of inserting specific information by modifying lower bits of quantized audio samples. In an audio signal, the modification of lower bits has little effect on the quality of the audio signal.

An echo concealment method is a method of inserting an echo in an audio signal that is small enough not to be perceived by the human ear.

And the spread spectrum based method is a method of transforming an audio signal into a frequency domain via discrete cosine transform, discrete fourier transform, or the like, performing spreading on specific binary information to form a PN (pseudo noise) sequence, and adding it to the audio signal transformed into the frequency domain.

In the present invention, the bit substitution encoding method will be mainly explained in the following description. However, the present invention is not limited to the bit substitution encoding method.

Fig. 3 is a detailed block diagram of an embedding unit for configuring the spatial encoder shown in fig. 2 according to the present invention.

Referring to fig. 3, in embedding spatial information in an imperceptible component of a downmix signal component by a bit substitution coding method, an insertion bit length (hereinafter, referred to as 'K value') for embedding the spatial information may use K bits (K > 0) according to a predetermined method instead of using only lower 1 bit. The K bits may use lower bits of the downmix signal, but are not limited to the lower bits. In this case, the predetermined method is a method of finding a masking threshold value according to, for example, a psychoacoustic model and allocating appropriate bits according to the masking threshold value.

The downmix signal Lo/Ro 301 as shown in the figure is transferred to the audio signal encoding unit 306 via the buffer 303 within the embedding unit.

The masking threshold calculation unit 304 segments the input audio signal into predetermined sections (e.g., blocks) and then finds masking thresholds for the respective sections.

The masking threshold calculation unit 304 finds an insertion bit length (i.e., K value) of the downmix signal that enables modification according to the masking threshold without occurrence of auditory distortion. That is, the number of bits used in embedding spatial information into the downmix signal is allocated per block.

In the description of the present invention, a block represents a data unit inserted using one insertion bit length (i.e., K value) existing within a frame.

At least one or more blocks may exist within one frame. If the frame length is fixed, the block length may decrease as the number of blocks increases.

Once the K value is determined, the K value can be included in the spatial information bitstream. That is, the bitstream shaping unit 305 can shape the spatial information bitstream in such a manner that the spatial information bitstream can include the K value therein. In this case, a sync word, an error detection code, an error correction code, etc. may be included in the spatial information bitstream.

The reshaped spatial information bitstream may be rearranged into an embeddable form. The rearranged spatial information bitstream is embedded into the downmix signal by the audio signal encoding unit 306 and then output as the audio signal Lo '/Ro' 307 in which the spatial information bitstream is embedded. In this case, the spatial information bitstream may be embedded in K bits of the downmix signal. The value of K may have a fixed value in one block. In summary, the K value is inserted into the spatial information bitstream during the reshaping or rearranging of the spatial information bitstream and then transferred to the decoding apparatus. And, the decoding apparatus can extract a spatial information bitstream using the K value.

As mentioned in the foregoing description, the spatial information bitstream undergoes a process of being embedded in the downmix signal by blocks. The process is performed by one of various methods.

The first method is implemented in a manner of simply replacing the lower K bits of the downmix signal with zeros and adding shaped spatial information bitstream data. For example, if the K value is 3, if sample data of the downmix signal is 11101101 and if spatial information bitstream data to be embedded is 111, lower 3 bits of '11101101' are replaced with zeros to provide 11101000. And, the spatial information bitstream data '111' is added to '11101000' to provide '11101111'.

The second method is implemented using the dithering method. First, the rearranged spatial information bitstream data is subtracted from an insertion region of the downmix signal. The downmix signal is then re-quantized based on the K value. And, the rearranged spatial information bitstream data is added to the re-quantized downmix signal. For example, if the K value is 3, if sample data of the downmix signal is 11101101 and if the spatial information bitstream data to be embedded is 111, '111' is subtracted from '11101101' to provide 11100110. The lower 3 bits are then re-quantized to provide '11101000' (by rounding off). And, '111' is added to '11101000' to provide '11101111'

Since the spatial information bitstream embedded in the downmix signal is a random bitstream, it may not have a white noise characteristic. Since the downmix signal is advantageous in terms of sound quality characteristics by adding a white noise type signal, the spatial information bitstream is subjected to a whitening (whitening) process to be added to the downmix signal. And, the whitening process is applied to a spatial information bitstream except for a sync word.

In the present invention, "whitening" means making a random signal have an equal or almost similar sound volume of an audio signal in all regions of a frequency domain.

In addition, when embedding the spatial information bitstream in the downmix signal, it is possible to minimize an auditory distortion by applying a noise shaping method to the spatial information bitstream.

In the present invention, 'noise shaping method' denotes a process of modifying noise characteristics to move energy of quantization noise generated from quantization to a higher frequency band on an audible band or a process of generating a time-varying filter corresponding to a masking threshold derived from a corresponding audio signal and modifying characteristics of noise generated from quantization by the generated filter.

Fig. 4 is a diagram of a first method for rearranging a spatial information bitstream according to the present invention.

Referring to fig. 4, as mentioned in the previous description, the spatial information bitstream can be rearranged into an embeddable form using a K value. In this case, the spatial information bitstream can be embedded into the downmix signal by being rearranged in various ways. Also, fig. 4 illustrates a method of embedding spatial information in a sample plane order.

The first method is a method of rearranging a spatial information bitstream in such a manner that the spatial information bitstream of a corresponding block is spread in units of K bits and the spread spatial information bitstream is sequentially embedded.

If the K value is 4 and if one block 405 is composed of N samples 403, the spatial information bitstream 401 may be rearranged to be sequentially embedded in the lower 4 bits of each sample.

As mentioned in the foregoing description, the present invention is not limited to the case where the spatial information bitstream is embedded in the lower 4 bits of each sample.

Also, in the lower K bits of each sample, the spatial information bitstream may be embedded first in the MSB (most significant bit) or first in the LSB (least significant bit), as shown.

In fig. 4, an arrow 404 indicates an embedding direction and numerals within parentheses indicate a data rearrangement order.

The bit plane indicates a specific bit layer composed of a plurality of bits.

In case that the number of bits of the spatial information bitstream to be embedded is smaller than the number of embeddable bits in the insertion area in which the spatial information bitstream is to be embedded, the remaining bits 406 are padded with zeros, a random signal is inserted in the remaining bits, or the remaining bits may be replaced with the original downmix signal.

For example, if the number of samples (N) configured as one block is 100 and if the value of K is 4, the number of embeddable bits (W) in the block is W ═ N ═ K ═ 100 ═ 4 ═ 400.

If the number of bits (V) of the spatial information bitstream to be embedded is 390 bits (i.e., V < M), the remaining 10 bits are padded with zeros, a random signal is inserted in the remaining 10 bits, or the remaining 10 bits are replaced with an original downmix signal, the remaining 10 bits are padded with a tail sequence indicating the end of data, or the remaining 10 bits can be padded with a combination thereof. The tail sequence represents a bit sequence indicating the end of the spatial information bitstream in the corresponding block. Although fig. 4 shows that the remaining bits are padded by blocks, the present invention also includes padding the remaining bits by an insertion frame in the manner described above.

Fig. 5 is a diagram of a second method for rearranging a spatial information bitstream according to the present invention.

Referring to fig. 5, the second method is implemented in a manner of rearranging a spatial information bitstream 501 in order of a bit plane 502. In this case, the spatial information bitstream may be sequentially embedded from lower bits of the downmix signal in blocks, although this does not set any limit to the present invention.

For example, if the number of samples (N) configured as one block is 100 and if the value of K is 4, 100 least significant bits configured as the bit plane 0502 are padded preferentially and 100 bits configured as the bit plane 1502 may be padded.

In fig. 5, an arrow 505 indicates an embedding direction and numerals in parentheses indicate a data rearrangement order.

This second method is particularly advantageous when extracting sync words at random positions. When searching for a sync word of an inserted spatial information bitstream from a rearranged and encoded signal, only LSBs may be extracted to search for the sync word.

And it is anticipated that the second method uses only the minimum LSB according to the number of bits (V) of the spatial information bitstream to be embedded. In this case, if the number of bits (V) of the spatial information bitstream to be embedded is less than the embeddable number of bits (W) of the insertion area in which the spatial information bitstream is to be embedded, the remaining bits are padded with zeros 506, the remaining bits are inserted with a random signal, replaced with the original downmix signal, padded with an end tail sequence indicating the end of data, or padded with a combination thereof. In particular, a method using a downmix signal is advantageous. Although fig. 5 shows an example of padding the remaining bits by block, the present invention also includes a case of padding the remaining bits by inserting frames in the above-described manner.

Fig. 6A illustrates a bit stream structure for embedding a spatial information bit stream in a downmix signal according to the present invention.

Referring to fig. 6A, a spatial information bitstream 607 may be rearranged by a bitstream shaping unit 305 to include a sync word 603 and a K value 604 for the spatial information bitstream.

Also, at least one error detection code or error correction code 606 or 608 (an error detection code will be described below) may be included in the shaped spatial information bitstream in the shaping process. The error detection code can determine whether the spatial information bitstream 607 is distorted during transmission or storage.

The error detection code includes CRC (cyclic redundancy check). The error detection code can be included by dividing into two steps. An error detection code 1 for the header 601 having the value K and an error detection code 2 for the frame data 602 of the spatial information bitstream may be separately included in the spatial information bitstream. Further, the remaining information 605 may be separately included in the spatial information bitstream. And, information on a rearrangement method of the spatial information bitstream, etc. may be included in the remaining information 605.

Fig. 6B is a detailed diagram of a configuration of a spatial information bitstream shown in fig. 6A. Fig. 6B illustrates an embodiment in which one frame of a spatial information bitstream 601 includes two blocks (the present invention is not limited thereto).

Referring to fig. 6B, the spatial information bitstream shown in fig. 6B includes a sync word 612, K values (K1, K2, K3, K4)613 to 616, remaining information 617, and

error detection codes

618 and 623.

The spatial information bitstream 610 includes a pair of blocks. In the case of a stereo signal, block 1 may be composed of

blocks

619 and 620 for the left and right channels, respectively. And block 2 may be composed of

blocks

621 and 622 for the left and right channels, respectively.

Although a stereo signal is shown in fig. 6B, the present invention is not limited to the stereo signal.

The insertion bit length (K value) of these blocks is included in the header portion.

K1613 indicates the insertion bit length of the left channel of block 1. K2614 indicates the insertion bit length of the right channel of block 1. K3615 indicates the insertion bit length of the left channel of block 2. And K4616 indicates the insertion bit size of the right channel of block 2.

Also, the error detection code can be included by being divided into two steps. For example, the error detection code 1618 of the header 609 including the K value and the error detection code 2 of the frame data 611 of the spatial information bitstream may be separately included.

Fig. 7 is a block diagram of a decoding apparatus according to the present invention.

Referring to fig. 7, the decoding apparatus according to the present invention receives an audio signal Lo '/Ro' 701 in which a spatial information bitstream is embedded.

The audio signal in which the spatial information bitstream is embedded may be one of a mono, stereo and multi-channel signal. For convenience of explanation, a stereo signal is used as an example of the present invention, but this does not set any limit to the present invention.

The embedded signal decoding unit 702 can extract a spatial information bitstream from the audio signal 701.

The spatial information bitstream extracted by the embedded signal decoding unit 702 is an encoded spatial information bitstream. And, the encoded spatial information bitstream may be an input signal to the spatial information decoding unit 703.

The spatial information decoding unit 703 decodes the encoded spatial information bitstream and then outputs the decoded spatial information bitstream to the multi-channel generating unit 704.

The multi-channel generating unit 704 receives the downmix signal 701 and spatial information resulting from decoding as inputs and then outputs the received inputs as a multi-channel audio signal 705.

Fig. 8 is a detailed block diagram of the embedded signal decoding unit 702 for configuring the decoding apparatus according to the present invention.

Referring to fig. 8, an audio signal Lo '/Ro' in which spatial information is embedded is input to an embedded signal decoding unit 702. Also, the sync word search unit 802 detects a sync word from the audio signal 801. In this case, the sync word may be detected from one channel of the audio signal.

After the sync word has been detected, the header decoding unit 803 decodes the header region. In this case, information of a predetermined length is extracted from the header area and the data reverse modification unit 804 can apply the reverse whitening scheme to the header area information except for the sync word among the extracted information.

Then, length information of the header area, etc. may be obtained from the header area information on which the reverse whitening scheme is applied.

And, the data inverse modification unit 804 can apply the inverse whitening scheme to the remaining spatial information bitstream. Information such as the K value can be obtained by header decoding. The original spatial information bitstream may be obtained by arranging the rearranged spatial information bitstream again using information such as a K value. Also, synchronization position information of frames in which the downmix signal and the spatial information bitstream are arranged, i.e., frame arrangement information 806, may be obtained.

Fig. 9 is a diagram for explaining a case where a general PCM decoding apparatus reproduces an audio signal according to the present invention.

Referring to fig. 9, an audio signal Lo '/Ro' in which a spatial information bitstream is embedded is used as an input of a general PCM decoding apparatus.

A general PCM decoding apparatus recognizes an audio signal Lo '/Ro' in which a spatial information bitstream is embedded as a general stereo audio signal to reproduce sound. Also, the reproduced sound is not different from the audio signal 902 before embedding the spatial information in terms of sound quality.

Accordingly, the audio signal in which spatial information is embedded according to the present invention is compatible with normal reproduction of a stereo signal in a general PCM decoding apparatus and has an advantage of providing a multi-channel audio signal in a decoding apparatus capable of multi-channel decoding.

Fig. 10 is a flowchart of an encoding method for embedding spatial information in a downmix signal according to the present invention.

Referring to fig. 10, an audio signal is downmixed from a multichannel signal (1001, 1002). In this case, the downmix signal may be one of a mono, stereo and multi-channel signal.

Next, spatial information is extracted from the multi-channel signal (1003). And generates a spatial information bitstream using the spatial information (1004).

The spatial information bitstream is embedded in the downmix signal (1005).

And, the entire bitstream including the downmix signal in which the spatial information bitstream is embedded is transmitted to the decoding apparatus (1006).

In particular, the present invention finds an insertion bit length (i.e., K value) of an insertion area in which a spatial information bitstream is to be inserted using a downmix signal and can embed the spatial information bitstream in the insertion area.

Fig. 11 is a flowchart of a method of decoding spatial information embedded in a downmix signal according to the present invention.

Referring to fig. 11, a decoding apparatus receives an entire bitstream including a downmix signal having a spatial information bitstream embedded therein (1101) and extracts the downmix signal from the bitstream (1102).

The decoding apparatus extracts and decodes a spatial information bitstream from the entire bitstream (1103).

The decoding apparatus extracts spatial information by decoding (1104) and then decodes the downmix signal using the extracted spatial information (1105). In this case, the downmix signal may be decoded into two channels or a plurality of channels.

In particular, the present invention may extract information of a spatial information bitstream embedding method and information of a K value and may decode the spatial information bitstream using the extracted embedding method and the extracted K value.

Fig. 12 is a diagram of a frame length of a spatial information bitstream embedded in a downmix signal according to the present invention.

Referring to fig. 12, 'frame' represents a unit having one header and allowing a predetermined length to be independently decoded. In the description of the present invention, 'frame' means an upcoming 'insertion frame'. In the present invention, 'insertion frame' denotes a unit of embedding a spatial information bitstream in a downmix signal.

Also, the length of the insertion frame may be defined per frame or a predetermined length may be used.

For example, the insertion frame length is made to have the same length as a frame length (S) of a unit corresponding to decoding and applying spatial information in a spatial information bitstream (hereinafter, referred to as "decoding frame length") (see, fig. 12(a)), to be a multiple of 'S' (see, fig. 12(b)), or to be a multiple of 'N' (see, fig. 12 (c)).

In the case where N is S, as shown in fig. 12(a), the decoding frame length (S, 1201) coincides with the insertion frame length (N, 1202) to facilitate the decoding process.

In the case where N > S, as shown in fig. 12(b), the number of bits attached due to a header, an error detection code (e.g., CRC), etc. can be reduced by concatenating a plurality of decoded frames (1203) to transmit one inserted frame (N, 1204).

In the case where N < S, one decoded frame (S, 1205) may be configured by concatenating several inserted frames (N, 1206), as shown in fig. 12 (c).

In the insertion frame header, information of an insertion bit length for embedding spatial information therein, information of an insertion frame length (N), information of a plurality of subframes included in the insertion frame, and the like may be inserted.

Fig. 13 is a diagram of a spatial information bitstream embedded in a downmix signal in an inserted frame unit according to the present invention.

First, in each case shown in fig. 12(a), 12(b), 12(c), the insertion frame and the decoding frame are configured to be multiples of each other.

Referring to fig. 13, for transmission, a fixed-length bit stream, for example, a packet 1303 in a Transport Stream (TS) format, may be configured.

In particular, the spatial information bitstream 1301 may be bounded by a packet unit of a predetermined length regardless of a decoding frame length of the spatial information bitstream. The packet into which the information such as the TS header 1302 is inserted is transmitted to the decoding apparatus. The length of the insertion frame may be defined per frame or a predetermined length may be used instead of being defined within the frame.

This method is necessary for changing the data rate of the spatial information bitstream in consideration of the fact that the masking threshold is different for each block according to the characteristics of the downmix signal and the maximum number of bits (K _ max) that can be allocated without distortion of the sound quality of the downmix signal.

For example, in case that K _ max is not enough to fully represent a spatial information bitstream required for a corresponding block, data up to K _ max is transmitted and the remaining data is then transmitted through another block.

In case that K _ max is sufficient, the spatial information bitstream of the next block is preloaded.

In this case, each TS packet has an independent header. Also, the header may include a sync word, TS packet length information, information of a plurality of subframes included in the TS packet, information of an insertion bit length allocated within the packet, and the like.

Fig. 14A is a diagram for explaining a first method for solving a temporal alignment problem of a spatial information bitstream embedded by inserting a frame unit.

Referring to fig. 14A, the length of the insertion frame is defined per frame or a predetermined length may be used.

The embedding method by the insertion frame unit may cause a problem of time alignment between the insertion frame start position of the embedded spatial information bitstream and the downmix signal frame. Therefore, a solution to the time alignment problem is needed.

In the first method shown in fig. 14A, a header 1402 (hereinafter referred to as 'decoding frame header') of a decoding frame 1403 of spatial information is separately placed.

Discriminating information indicating whether there is position information of an audio signal to which spatial information is to be applied may be included in the decoding frame header 1402.

For example, in the case of the TS packets 1404 and 1405, distinction information 1408 (e.g., flag) indicating whether or not the decoding frame header 1402 exists is contained in the TS packet header 1404.

If the discriminating information 1408 is 1, i.e., if the decoding frame header 1402 exists, discriminating information indicating location information of a downmix signal to which the spatial information bitstream will be applied or not can be extracted from the decoding frame header.

Then, position information 1409 (e.g., delay information) of the downmix signal to which the spatial information bitstream is to be applied can be extracted from the decoding frame header 1402 according to the extracted discriminating information.

If the distinguishing information 1411 is 0, the location information may not be included in the header of the TS packet.

In general, the spatial information bitstream 1403 preferably appears in front of the corresponding downmix signal 1401. Thus, the position information 1409 may be a sample value for a delay.

Meanwhile, in order to prevent the problem of an excessive increase in the amount of information required to represent a sample value due to an excessive delay, a sample group unit (e.g., a granularity unit) representing a group of samples or the like is defined. Thus, the position information can be represented by the sample group unit.

As mentioned in the foregoing description, the TS sync word 1406, the insertion bit length 1407, the discriminating information indicating whether there is a decoded frame header, and the rest information 140 may be included in the TS header.

Fig. 14B is a diagram for explaining a second method for solving a temporal alignment problem of a spatial information bitstream embedded by an insertion frame having a length defined by frames.

Referring to fig. 14B, in case of, for example, a TS packet, the second method is implemented in a manner of matching a start point 1413 of a decoding frame, a start point of the TS packet, and a start point of a corresponding downmix signal 1412.

For the matched portion, discriminating information 1420 or 1422 (e.g., a flag) indicating that the three types of start points are aligned may be included in the header 1415 of the TS packet.

Fig. 14B shows that the three kinds of start points are matched at the nth frame 1412 of the downmix signal. In this case, the distinction information 1422 may have a value of 1.

If the three start points do not match, the distinguishing information 1420 may have a value of 0.

To match these three starting points together, a specific portion 1417 following the previous TS data packet is padded with zeros, with random signals inserted therein, replaced with the audio signal of the original downmix, or padded with a combination thereof.

As mentioned in the foregoing description, the TS sync word 1418, the insertion bit length 1419, and the remaining information 1421 may be contained within the TS packet header 1415.

Fig. 15 is a diagram of a method of appending a spatial information bitstream to a downmix signal according to the present invention.

Referring to fig. 15, the length of a frame (hereinafter, referred to as 'additional frame') to which a spatial information bitstream is attached may be a length unit defined per frame or a predetermined length unit not defined per frame.

For example, as shown, the insertion frame length may be obtained by multiplying or dividing the decoding frame length 1504 of the spatial information by N (where N is a positive integer) or may have a fixed length unit.

If the decoding frame length 1504 is different from the insertion frame length, it is possible to generate an insertion frame having the same length as the decoding frame length 1504 without, for example, segmenting the spatial information bitstream but randomly cutting the spatial information bitstream to fit the insertion frame.

In this case, the spatial information bitstream is configured to be embedded in the downmix signal or may be configured to be attached to the downmix signal instead of being embedded in the downmix signal.

In a signal converted from an analog signal to a digital signal (hereinafter referred to as 'first audio signal') like a PCM signal, a spatial information bitstream may be configured to be embedded in the first audio signal.

In a further compressed digital signal (hereinafter referred to as 'second audio signal') like the MP3 signal, a spatial information bitstream can be configured to be appended to the second audio signal.

For example in case of using the second audio signal, the downmix signal is represented as a bitstream in a compressed format. Thus, as shown, the downmix signal bitstream 1502 exists in a compressed format and spatial information of a decoding frame length 1504 is appended to the downmix signal bitstream 1502.

Accordingly, the spatial information bitstream can be transmitted in bursts.

The header 1503 may be present in the decoded frame. And, position information of the downmix signal to which the spatial information is applied is included in the header 1503.

Meanwhile, the present invention includes a case where the spatial information bitstream is configured as an additional frame of a compressed format (e.g., the TS bitstream 1506) to be appended to the downmix signal bitstream 1502 of the compressed format.

In this case, there may be a TS header 1505 of the TS bitstream 1506. Also, at least one of additional frame sync information 1507, distinction information 1508 indicating whether a header of a decoded frame exists within the additional frame, information of the number of subframes included in the additional frame, and remaining information 1509 may be included in an additional frame header (e.g., TS header 1505). Also, the distinction information indicating whether the start point of the additional frame and the start point of the decoded frame are matched may also be included in the additional frame.

If the decoded frame header exists within the additional frame, discriminating information indicating whether there is position information of the downmix signal to which the spatial information is applied is extracted from the decoded frame header.

Then, position information of the downmix signal to which the spatial information is applied can be extracted according to the discriminating information.

Fig. 16 is a flowchart of a method of encoding a spatial information bitstream embedded in a downmix signal through insertion frames of various sizes according to the present invention.

Referring to fig. 16, audio signals are downmixed from a multi-channel audio signal (1601, 1602). In this case, the downmix signal may be a mono, stereo or multi-channel audio signal.

And, spatial information is extracted from the multi-channel audio signal (1601, 1603).

The extracted spatial information is then used to generate a spatial information bitstream (1604). The generated spatial information may be embedded in the downmix signal by an insertion frame unit having a length corresponding to an integer multiple of a decoding frame length of each frame.

If the decoding frame length (S) is greater than the insertion frame length (N) (1605), the insertion frame length (N) is configured to be equal to one S by concatenating a plurality of N together (1607).

If the decoding frame length (S) is less than the insertion frame length (N) (1606), the insertion frame length (N) is configured to be equal to one N by concatenating a plurality of S together (1608).

If the decoding frame length (S) is equal to the insertion frame length (N), the insertion frame length (N) is configured to be equal to the decoding frame length (S) (1609).

The spatial information bitstream configured in the above-described manner is embedded into the downmix signal (1610).

Finally, the entire bitstream including the downmix signal in which the spatial information bitstream is embedded is transmitted (1611).

Further, in the present invention, information of an insertion frame length of a spatial information bitstream may be embedded in the entire bitstream.

Fig. 17 is a flowchart of a method of encoding a spatial information bitstream embedded in a downmix signal in a fixed length according to the present invention.

Referring to fig. 17, an audio signal is downmixed from a multi-channel audio signal (1701, 1702). In this case, the downmix signal may be a mono, stereo or multi-channel audio signal.

And, spatial information is extracted from the multi-channel audio signal (1701, 1703).

The extracted spatial information is then used to generate a spatial information bitstream (1704).

After the spatial information bitstream has been divided into a bitstream having a fixed length (packet unit), e.g., a Transport Stream (TS) (1705), the fixed length spatial information bitstream is embedded into a downmix signal (1706).

Next, the entire bitstream including the downmix signal in which the spatial information bitstream is embedded is transmitted (1707).

Further, in the present invention, an insertion bit length (i.e., K value) of an insertion area in which a spatial information bitstream is embedded is obtained using a downmix signal, and the spatial information bitstream can be embedded in the insertion area.

Fig. 18 is a diagram of a first method of embedding a spatial information bitstream in an audio signal downmixed onto at least one channel according to the present invention.

In case that the downmix signal is configured with at least one channel, the spatial information is considered as data common to the at least one channel. Therefore, a method of embedding spatial information by spreading the spatial information on the at least one channel is required.

Fig. 18 illustrates a method of embedding spatial information on one channel of a downmix signal having at least one channel.

Referring to fig. 18, spatial information is embedded in K bits of a downmix signal. In particular, spatial information is embedded in only one channel and not in the other channels. Also, the K value may be different for each block or channel.

As mentioned in the foregoing description, these bits corresponding to the K value may correspond to the lower bits of the downmix signal, but the present invention is not limited thereto. In this case, the spatial information bitstream may be inserted into one channel in a bit plane order starting from the LSB or in a sample plane order.

Fig. 19 is a diagram of a second method of embedding a spatial information bitstream in an audio signal downmixed onto at least one channel according to the present invention. Fig. 19 illustrates a downmix signal having two channels for convenience of explanation, but the present invention is not limited thereto.

Referring to fig. 19, the second method is implemented in a manner of sequentially embedding spatial information into a block-n of one channel (e.g., a left channel), a block-n of another channel (e.g., a right channel), a block- (n +1) of a previous channel (left channel), and the like. In this case, the synchronization information may be embedded in only one channel.

Although the spatial information bitstream may be embedded in the downmix signal per block, the spatial information bitstream can also be extracted block-wise or frame-wise in the decoding process.

Since the signaling characteristics of the two channels of the mixed signal differ from each other, the K values can be assigned to the two channels by finding their respective masking thresholds separately. In particular, as shown, K1 and K2 are assigned to two channels, respectively.

In this case, spatial information may be embedded in each channel in a bit plane order or a sample plane order starting from the LSB.

Fig. 20 is a diagram of a third method of embedding a spatial information bitstream in an audio signal downmixed onto at least one channel according to the present invention. Fig. 20 illustrates a downmix signal having two channels, but the present invention is not limited thereto.

Referring to fig. 20, the third method is implemented in a manner of embedding spatial information by spreading the spatial information on two channels. In particular, spatial information is embedded in such a manner that respective embedding orders alternate in sample units for two channels.

Since the signaling characteristics of the two channels of the downmix signal are different from each other, it is possible to allocate the K value differently to the two channels by finding the masking thresholds of the two channels individually. Specifically, K as shown₁And K₂Are assigned to two channels, respectively.

The K values of each block may be different from each other. For example, spatial information is placed sequentially at K of sample-1 of one channel (e.g., left channel)₁K of sample-1 of one lower middle, other channel (e.g. right channel)₂K of sample-2 of the lower middle, previous channel (e.g. left channel)₁K of sample 2 of one lower middle and following channel (e.g., right channel)₂In the lower order.

In the drawings, numerals in parentheses indicate the order of filling the spatial information bitstream. Although fig. 20 shows that the spatial information bitstream is filled from the MSB, the spatial information bitstream may be filled from the LSB.

Fig. 21 is a diagram of a fourth method of embedding a spatial information bitstream in an audio signal downmixed onto at least one channel according to the present invention. Fig. 21 illustrates a downmix signal having two channels, but the present invention is not limited thereto.

Referring to fig. 21, the fourth method is implemented in a manner of embedding spatial information by scattering the spatial information onto at least one channel. Specifically, the spatial information is embedded in such a manner that the respective embedding order is alternated from LSB to LSB in bit plane units for two channels.

Since the signaling characteristics of the two channels of the downmix signal are different from each other, the value of K (K) can be obtained by separately finding the respective mask thresholds of the two channels₁And K₂) Are allocated differently to the two channels. Specifically, K1 and K2 as shown may be assigned to two channels, respectively.

The K values of each block may be different from each other. For example, spatial information is placed in order of 1-bit least significant bit of sample-1 of one channel (e.g., a left channel), 1-bit least significant bit of sample-1 of another channel (e.g., a right channel), 1-bit least significant bit of sample-2 of a previous channel (e.g., a left channel), and 1-bit least significant bit of sample-2 of a subsequent channel (e.g., a right channel). In the drawings, numerals in blocks indicate an order of filling spatial information.

In the case where an audio signal is stored in a storage medium having no auxiliary data area (e.g., a stereo CD) or transmitted through an SPDIF or the like, L/R channels are interleaved by sample unit. Thus, if the audio signal is stored by the third or fourth method, it is advantageous for the decoder to process the audio signal according to the received order.

And, the fourth method is applicable to a case where the spatial information bitstream is stored by being rearranged by a bit plane unit.

As mentioned in the foregoing description, in the case where a spatial information bitstream is embedded by being spread over two channels, K values can be differently allocated to the channels, respectively. In this case, the K value may be transmitted separately for each channel within the bitstream. In case of transmitting multiple K values, the differential encoding is applicable to the case of encoding K values.

Fig. 22 is a diagram of a fifth method of embedding a spatial information bitstream in an audio signal downmixed onto at least one channel according to the present invention. Fig. 22 illustrates a downmix signal having two channels, but the present invention is not limited thereto.

Referring to fig. 22, the fifth method is implemented in a manner of embedding spatial information by spreading the spatial information on two channels. Specifically, the fifth method is implemented in such a manner that the same value is repeatedly inserted in each of the two channels.

In this case, a value having the same sign may be inserted into each of the at least two channels, or values having different signs may be inserted into the at least two channels, respectively.

For example, a value of 1 is inserted into each of the two channels or values of 1 and-1 are alternately inserted into the two channels, respectively.

The fifth method has the advantage of facilitating checking for transmission errors by comparing the least significant inserted bits (e.g., K bits) of at least one channel.

In particular, in the case of transmitting a mono audio signal to a stereo medium such as a CD, since a channel-L (left channel) of a downmix signal and a channel-R (right channel) of the downmix signal are identical to each other, robustness and the like can be improved by equalizing inserted spatial information. In this case, spatial information is embedded in each channel in a bit plane order starting from the LSB or in a sample plane order.

Fig. 23 is a diagram of a sixth method of embedding a spatial information bitstream in an audio signal downmixed onto at least one channel according to the present invention.

A sixth method relates to a method of inserting spatial information into a downmix signal having at least one channel in case that a frame of each channel includes a plurality of blocks (length B).

Referring to fig. 23, the insertion bit length (i.e., K value) of each channel and block may each have a different value or each channel and block may have the same value.

Inserting bit lengths (e.g. K)₁、K₂、K₃And K₄) May be stored in a frame header transmitted once for a full frame. Also, the frame header may be located on the LSB. In this case, the header may be inserted in bit plane units. And, the spatial information data may be alternately inserted in a sample unit or a block unit. In fig. 23, the number of blocks in a frame is 2. Thus, the block length (B) is N/2. In this case, the number of bits inserted into the frame is (K1+ K2+ K3+ K4) × B.

Fig. 24 is a diagram of a seventh method of embedding a spatial information bitstream in an audio signal downmixed onto at least one channel according to the present invention. Fig. 24 illustrates a downmix signal having two channels, but the present invention is not limited thereto.

Referring to fig. 22, the seventh method is implemented in a manner of embedding spatial information by spreading the spatial information on two channels. Specifically, the seventh method is characterized by mixing a method of alternately inserting spatial information in two channels in a bit plane order starting from LSB or MSB and a method of alternately inserting spatial information in two channels in a sample plane order.

The method may be performed in frame units or may be performed in block units.

The shaded portions 1 to C shown in fig. 24 correspond to the headers and may be inserted in the LSB or MSB in bit plane order to facilitate search for the inserted frame sync word.

The other portions (non-shaded portions) C +1 and higher bit portions correspond to portions other than the header and may be alternately inserted into the two channels in sample units in order to extract spatial information data. The insertion bit size (e.g., K value) may have different or the same value for each channel and block as each other. Also, all insertion bit lengths may be included in the header.

Fig. 25 is a flowchart of a method of encoding spatial information to be embedded in a downmix signal having at least one channel according to the present invention.

Referring to fig. 25, an audio signal is downmixed from a multi-channel audio signal into one channel (2501, 2502). Also, spatial information is extracted from the multi-channel audio signal (2501, 2503).

The extracted spatial information is then used to generate a spatial information bitstream (2504).

The spatial information bitstream is embedded in a downmix signal having at least one channel (2505). In this case, one of seven methods of embedding a spatial information bitstream in at least one channel may be used.

Next, the entire stream including the downmix signal in which the spatial information bitstream is embedded is transmitted (2506). In this case, the present invention finds K values using a downmix signal and embeds a spatial information bitstream in K bits.

Fig. 26 is a flowchart of a method of decoding a spatial information bitstream embedded in a downmix signal having at least one channel according to the present invention.

Referring to fig. 26, the spatial decoder receives a bitstream including a downmix signal in which a spatial information bitstream is embedded (2601).

A downmix signal is detected from a received bit stream (2602).

A spatial information bitstream embedded in a downmix signal having at least one channel is extracted and decoded according to the received bitstream (2603).

Next, the downmix signal is converted into a multi-channel signal using spatial information obtained by decoding (2604).

The present invention extracts the difference information of the order in which the spatial information bitstream is embedded and can extract and decode the spatial information bitstream using the difference information.

In addition, the present invention extracts information of a K value from a spatial information bitstream and can decode the spatial information bitstream using the K value.

Industrial applications

Therefore, the present invention provides the following effects or advantages.

First, in encoding a multi-channel audio signal according to the present invention, spatial information is embedded in a downmix signal. Accordingly, a multi-channel audio signal can be stored in/reproduced from a storage medium (e.g., a stereo CD) having no auxiliary data area or an audio format having no auxiliary data area.

Second, spatial information may be embedded in the downmix signal in various frame lengths or a fixed frame length. And, spatial information may be embedded in a downmix signal having at least one channel. Therefore, the present invention improves encoding and decoding efficiency.

While the invention has been illustrated and described herein in connection with the preferred embodiments thereof, it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention. Accordingly, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims

1. A method of decoding an audio signal, comprising:

receiving a downmix signal and an insertion frame embedded in the downmix signal, the insertion frame comprising a header and frame data, the frame data comprising a spatial information bitstream;

extracting, from the header, an insertion frame length of the insertion frame and information of a plurality of subframes included in the insertion frame;

extracting the frame data using the insertion frame length of the insertion frame and information of the plurality of sub-frames;

extracting spatial information by decoding a spatial information bitstream included in the frame data; and

generating a multi-channel audio signal by applying the spatial information to the downmix signal,

wherein,

the downmix signal is Pulse Code Modulated (PCM) data, comprising n bit PCM samples,

the inserted frame corresponds to the lower k bits of the n bits,

the header is embedded in the least significant bits of the sample, an

The spatial information is sequentially embedded first in the most significant bit of the lower k bits of each sample,

wherein the audio signal includes at least two channels, the spatial information being alternately embedded in the channels in sample units by distributing the spatial information to the at least two channels.

2. The method of claim 1, wherein the spatial information bitstream includes position information of the downmix signal to which spatial information is applied.

3. An apparatus for decoding an audio signal, comprising:

means for receiving a downmix signal and an insertion frame embedded in the downmix signal, the insertion frame comprising a header and frame data, the frame data comprising a spatial information bitstream;

means for extracting, from the header, an insertion frame length of the insertion frame and information of a plurality of subframes included in the insertion frame;

means for extracting the frame data using the insertion frame length of the insertion frame and information of the plurality of sub-frames;

means for extracting spatial information by decoding a spatial information bitstream included in the frame data; and

means for generating a multi-channel audio signal by applying the spatial information to the downmix signal,

wherein,

the inserted frame corresponds to the lower k bits of the n bits,

the header is embedded in the least significant bits of the sample, an

wherein the audio signal includes at least two channels, and the spatial information is alternately embedded in the channels in sample units by spreading the spatial to the at least two channels.

4. The apparatus of claim 3, wherein the spatial information bitstream includes position information of the downmix signal to which the spatial information is applied.

5. A method of encoding an audio signal, comprising:

receiving a spatial information bitstream determined when downmixing a multi-channel audio signal;

generating an insertion frame length of an insertion frame and information of a plurality of subframes included in the insertion frame;

generating a downmix signal and an insertion frame embedded in the downmix signal, wherein the insertion frame includes a header including information of the insertion frame length and the plurality of sub-frames of the insertion frame and frame data including a spatial information bitstream;

wherein,

the inserted frame corresponds to the lower k bits of the n bits,

the header is embedded in the least significant bits of the sample, an

6. An apparatus for encoding an audio signal, comprising:

means for receiving a spatial information bitstream determined when downmixing a multi-channel audio signal;

means for generating an insertion frame length of an insertion frame and information of a plurality of subframes included in the insertion frame;

means for generating a downmix signal and an insertion frame embedded in the downmix signal, wherein the insertion frame includes a header including information of the insertion frame length and the plurality of sub-frames of the insertion frame and frame data including a spatial information bitstream;

wherein,

the inserted frame corresponds to the lower k bits of the n bits,

the header is embedded in the least significant bits of the sample, an