KR20140128565A

KR20140128565A - Apparatus and method for audio signal processing

Info

Publication number: KR20140128565A
Application number: KR20130047059A
Authority: KR
Inventors: 이태규; 오현오; 송명석; 송정욱
Original assignee: 인텔렉추얼디스커버리 주식회사
Priority date: 2013-04-27
Filing date: 2013-04-27
Publication date: 2014-11-06

Abstract

According to an aspect of the present invention, there is provided a method of processing an audio signal, comprising: obtaining position information of each of a plurality of composite speakers; Obtaining a crossover frequency of the composite speaker; Receiving a bit string including an object signal and object position information; Decoding the object signal and the object location information from the received bit stream; Calculating energy distribution information of the decoded object signal; Selecting two or more composite loudspeakers among the plurality of composite loudspeakers using the decoded object position information; Generating an output gain value of the selected two or more composite speakers using the crossover frequency and the energy distribution of the object signal; And rendering the decoded object signal into a plurality of channel signals using the generated gain value.

Description

[0001] APPARATUS AND METHOD FOR AUDIO SIGNAL PROCESSING [0002]

The present invention relates to a method and apparatus for processing an object audio signal, and more particularly, to a method and apparatus for encoding and decoding an object audio signal or rendering the object audio signal in a three-dimensional space.

3D audio is a series of signal processing to provide a lively sound in a three-dimensional space, by providing another dimension in the height direction on a horizontal sound scene (2D) provided by existing surround audio, Transmission, encoding, reproduction technology, and the like. Particularly, in order to provide 3D audio, a rendering technique is widely required in which an image is formed at a virtual position where a speaker is not used even if a larger number of speakers are used or a smaller number of speakers are used.

3D audio is expected to become an audio solution for future high-definition TVs (UHDTVs), and it is expected that 3D audio will be able to be used in a variety of applications such as sound in vehicles evolving into high-quality infotainment space as well as theater sound, personal 3DTV, tablet, Games and so on.

For 3D audio, it is necessary to transmit signals of more than 22.2 channels, which is a conventional compression transmission technique. In the case of conventional high-quality encoding such as MP3, AAC, DTS, and AC3, it is optimized to transmit only channels less than 5.1 channels.

In addition, in order to reproduce 22.2 channel signals, an infrastructure for a listening space in which 24 speaker systems are installed is required. In short, it is not easy to spread to the market, so a technology for effectively reproducing 22.2 channel signals in a space with a smaller number of speakers , A technique that allows the reproduction of a conventional stereo or 5.1 channel sound source in a larger number of speakers, 10.1 channel and 22.2 channel environment, and also a sound provided by the original sound source A technique for providing a scene, and a technique for enabling 3D sound to be enjoyed in a headphone listening environment. Such techniques are referred to herein as collective rendering and are referred to in detail as downmix, upmix, flexible rendering, binaural rendering, and the like.

On the other hand, an object-based signal transmission scheme is needed as an alternative for efficiently transmitting such a sound scene. It is more advantageous to transmit on an object basis than on a channel-based transmission according to a sound source. In addition, when transmitting on an object basis, the user can arbitrarily control the reproduction size and position of the objects, . Accordingly, there is a need for an effective transmission method capable of compressing object signals at a high transmission rate.

In addition, a sound source in which the channel-based signal and the object-based signal are mixed may exist, thereby providing a new type of listening experience. Accordingly, there is a need for a technique for effectively transmitting a channel signal and an object signal together and rendering the same effectively.

Finally, depending on the specificity of the channel and the speaker environment at the playback stage, exception channels that are difficult to reproduce in the conventional manner may occur. In this case, there is a need for a technique for effectively reproducing the exception channel based on the speaker environment at the reproduction end.

According to an aspect of the present invention, there is provided a method of reproducing an audio signal including an object signal using a composite speaker, the method comprising: receiving a crossover frequency of the composite speaker; Receiving position information of the composite speaker; Receiving a bit string including an object signal and object position information; Decoding the object signal and the object location information from the received bit stream; Calculating an energy distribution of the decoded object signal; Selecting two or more composite loudspeakers using the decoded object position information; Generating an output gain value of the selected two or more composite speakers using the crossover frequency and the energy distribution of the object signal; And rendering the decoded object signal as a channel signal using the generated gain value.

According to the present invention, virtual speakers are generated using two different composite speakers. This virtual loudspeaker has the effect of more precisely orienting the sound image than the existing sound image localization method. The effects of the present invention are not limited to the above-mentioned effects, and the effects not mentioned can be clearly understood by those skilled in the art from the present specification and the accompanying drawings.

1 is a view for explaining viewing angles according to image sizes at the same viewing distance
2 is a diagram showing a configuration of a speaker arrangement of 22.2 channels
FIG. 3 is a conceptual diagram showing the position of each sound object on the listening space in which the listener listens to 3D audio.
FIG. 4 is an exemplary configuration diagram of an object signal group formed using the grouping method according to the present invention with respect to the objects shown in FIG.
5 is a block diagram of an embodiment of an object audio signal encoder according to the present invention.
6 is an exemplary configuration diagram of a decoding apparatus according to an embodiment of the present invention.
FIG. 7 is a block diagram of an embodiment of a bit string generated by coding by the encoding method according to the present invention
8 is a block diagram of an object and channel signal decoding system according to an embodiment of the present invention.
9 is a block diagram of another object and channel signal decoding system according to the present invention
10 is a block diagram of an embodiment of a decoding system according to the present invention
11 is a view for explaining masking thresholds for a plurality of object signals according to the present invention;
12 is a block diagram of an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention
13 is a diagram for explaining the arrangement according to the ITU-R recommendation for the 5.1 channel setup and the case where it is arranged at an arbitrary position;
FIG. 14 is a block diagram illustrating a structure of an embodiment in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other.
15 is a block diagram illustrating a structure of another embodiment that implements decoding and rendering of an object bit stream according to the present invention.
16 is a diagram showing a structure for determining and transmitting a transmission plan between a decoder and a renderer
17 is a conceptual diagram for explaining a concept of reproducing speakers absent by the display among the front-mounted speakers in the 22.2 channel system using the surrounding channels
18 is a flowchart illustrating a method of processing a sound source according to an embodiment of the present invention;
19 is a diagram illustrating an example of mapping a signal generated in each band to a speaker disposed in the vicinity of the TV
Figure 20 shows an embodiment of the frequency response of a mixed speaker
21 is a diagram showing an example of generating a virtual speaker
22 is a diagram showing a structure for generating a virtual speaker;
23 is a view showing a relationship among products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to be illustrative of the present invention and not to limit the scope of the invention. Should be interpreted to include modifications or variations that do not depart from the spirit of the invention. The terms and accompanying drawings used herein are for the purpose of facilitating the present invention and the shapes shown in the drawings are exaggerated for clarity of the present invention as necessary so that the present invention is not limited thereto And are not intended to be limited by the terms and drawings. In the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In the present invention, the following terms can be interpreted according to the following criteria, and terms not described may be construed in accordance with the following. Coding can be interpreted as encoding or decoding as occasion demands, and information is a term that includes all of values, parameters, coefficients, elements, and the like, But the present invention is not limited thereto.

According to an aspect of the present invention, there is provided a method of processing an audio signal, the method comprising: receiving a bit stream including an object signal and object position information; Decoding the object signal and the object position information using the received bit stream; Receiving past object position information on a storage medium; Generating an object movement path using the received past object position information and the decoded object position information; Generating a variable gain value according to time using the generated object movement path; Generating a modified variable gain value using the generated variable gain value and the weighting function; And generating a channel signal from the decoded object signal using the modified variable gain value.

In addition, the weighting function may be changed based on a physiological characteristic of the user.

In addition, the physiological characteristic may be extracted using an image or an image.

In addition, the physiological characteristic may include at least one of information on the shape of the head, the body of the user, and the shape of the external ear.

Hereinafter, a method and apparatus for processing an object audio signal according to an embodiment of the present invention will be described.

1 is a view for explaining viewing angles according to image sizes (e.g., UHDTV and HDTV) on the same viewing distance. Display technology has been developed and the size of the image has been increasing in accordance with the demand of the consumer. As shown in FIG. 1, UHDTV (7680 * 4320 pixel image, 110) is about 16 times larger than that of HDTV (1920 * 1080 pixel image, 120). If the HDTV is installed on the living room wall and the viewer is sitting on the living room sofa at a certain viewing distance, the viewing angle may be about 30 degrees. However, when UHDTV is installed in the same viewing distance, the viewing angle reaches about 100 degrees. When a large screen of high resolution and high resolution is installed as described above, it may be desirable to provide a sound having high sense of presence and impact suitable for the large content. In order to provide a viewer with almost the same environment as in the scene, the presence of one or two surround channel speakers may not be sufficient. Thus, a multi-channel audio environment having a larger number of speakers and channels may be required.

In addition to the home theater environment, there are also personal 3D TVs, smartphone TVs, 22.2-channel audio programs, cars, 3D videos, telepresence rooms, cloud-based gaming, Can be.

2 is a diagram showing a speaker arrangement of 22.2 channels as an example of a multi-channel. 22.2ch may be an example of a multi-channel environment for enhancing the sound field, and the present invention is not limited to a specific number of channels or a specific speaker arrangement. Referring to FIG. 2, a total of nine channels may be provided in the top layer 210. It can be seen that a total of nine speakers are arranged at the front, three at the middle position, and three at the surround position. In the middle layer 220, five speakers may be arranged on the front side, two speakers may be arranged on the middle position, and three speakers may be disposed on the surround position. Of the five speakers on the front, three of the center positions can be contained within the TV screen. A total of three channels and two LFE channels 240 may be installed on the bottom layer 230 at the front side.

In this manner, a high computation amount may be required for transmitting and reproducing multi-channel signals up to several tens of channels. Also, a high compression ratio may be required in consideration of a communication environment and the like. In addition, many households do not have a multi-channel (eg, 22.2-ch) speaker environment and many listeners have a 2-channel or 5.1-channel setup. When the multi-channel is converted and re-converted into 2-channel and 5.1-channel, it is necessary to store 22.2-channel PCM signals as well as communication inefficiency, resulting in inefficiency in memory management.

3 is a conceptual diagram showing positions of sound objects 120 constituting a three-dimensional sound scene in a listening space 130 where the listener 110 listens 3D audio. 3, each object 120 is shown as a point source for convenience of illustration. However, in addition to a point source, a sound source in the form of a plain wave or an ambient sound source A sound that is spread over a perceptible whole bearing).

FIG. 4 shows that the object signal groups 410 and 420 are formed using the grouping method according to the present invention with respect to the illustrated objects of FIG. According to the present invention, in encoding or processing an object signal, an object signal group is formed and encoded or processed in units of grouped objects. In this case, the coding includes discrete coding of an object as an individual signal or parametric coding of an object signal. Particularly, according to the present invention, grouped objects are generated in units in generating a downmix signal for parameter encoding of an object signal and parameter information of objects corresponding to a downmix. That is, in the case of the conventional SAOC encoding technique, all the objects constituting the sound scene may be one downmix signal (the downmix signal may be mono (1 channel) or stereo (2 channels) And the object parameter information corresponding thereto. However, as in the scenario considered in the present invention, it is possible to use more than 20 objects, more than 200 objects, and 500 objects as one downmix and corresponding object parameter information When expressed as parameters, upmixing and rendering that provide the desired level of sound quality is virtually impossible. Accordingly, in the present invention, a method of grouping objects to be encoded and generating a downmix on a group basis is used. The downmix gain may be applied when each object is downmixed in the process of being downmixed by a group, and the applied downmix gain for each object is included in the bit string for each group as additional information. Meanwhile, in order to effectively control encoding efficiency or overall gain, a global gain applied to each group and an object group gain applied to only the objects of each group can be used, and they are encoded and included in a bit string, Lt; / RTI >

The first method of forming a group is to form a group of nearby objects considering the position of each object on a sound scene. The object groups 410 and 420 of FIG. 4 are examples formed by this method. This is because the crosstalk distortion occurring between the objects due to incompleteness of the parameter coding, the distortion caused when the objects are moved to the third position or the size is changed are not audible to the listener 110 as much as possible Method. Distortion in objects at the same location is likely to be inaudible to the listener due to relatively masking. For the same reason, even when individual encoding is performed, the effect of sharing additional information through grouping among objects located at similar positions can be expected.

5 is a block diagram of an embodiment of an object audio signal encoder including an object grouping 550 and a downmix 520, 540 method according to the present invention. Downmixing is performed for each group, and parameters necessary for restoring the downmixed objects are generated (520, 540). The downmix signals generated for each group are further encoded through a waveform encoder 560 for encoding channel-specific waveforms such as AAC and MP3. This is commonly referred to as a core codec. In addition, encoding may be performed through coupling between downmix signals. The signals generated through the respective encoders are formed into a single bit stream through the multiplexer 570 and transmitted. Accordingly, it can be seen that the bitstreams generated through the downmix & parameter encoders 520 and 540 and the waveform encoder 560 all encode constituent objects constituting one sound scene. Also, object signals belonging to different object groups within the generated bit stream are encoded with the same time frame, and thus, they may be reproduced at the same time. The grouping information generated by the object grouping unit can be encoded and transmitted to the receiving end.

FIG. 6 is a block diagram illustrating an embodiment of performing decoding on a coded and transmitted signal. In the decoding process, a plurality of downmix signals decoded by the waveform decoding 620 are input to the upmixer & parameter decoder together with the corresponding parameters. Since there are a plurality of downmixes, it is necessary to decode a plurality of parameters.

If the transmitted bitstream includes a global gain and an object group gain, they can be applied to restore the size of a normal object signal. On the other hand, these gain values can be controlled in the rendering or transcoding process, and the gain of the entire signal can be controlled through the global gain control, and the group gain can be controlled through the object group gain. For example, when object grouping is performed on the basis of the playback speaker, the object group gain can be easily adjusted by adjusting the gain in order to implement flexible rendering, which will be described later.

At this time, although a plurality of parameter encoders or decoders are illustrated as being processed in parallel for convenience of explanation, it is also possible to sequentially perform encoding or decoding of a plurality of object groups through one system.

Another method of forming an object group is to group objects having low correlation into one group. This is a characteristic of the parameter encoding, and it takes into consideration that the objects having high correlation are difficult to separate each from the downmix. At this time, it is also possible to control the parameters such as the downmix gain and the like at the time of downmixing so that the grouped objects are further distanced from each other. At this time, the parameters used are preferably transmitted so that they can be used for signal restoration upon decoding.

Another method of forming an object group is to group objects having a high degree of correlation into one group. This is a method for increasing the compression efficiency in applications where the degree of utilization is not high, although it is difficult to separate the objects with high correlation using the parameters. In the case of a core codec, complex signals with various spectra require a lot of bits. Therefore, when a single core codec is used to group highly correlated objects, encoding efficiency is high.

Another method of forming an object group is to judge whether the object is masked or not and then encode it. For example, when the object A is in the relationship of masking the object B, the object B may be omitted in the encoding process when the two signals are included in one downmix and encoded into the core codec. In this case, when the object B is obtained by using the parameter at the decoding end, the distortion is large. Therefore, it is preferable that object A and object B having such a relationship are included in separate downmixes. On the other hand, when an object A and an object B are in a masking relationship but do not need to separately render two objects or at least do not need to separately process the masked object, conversely, . Therefore, the selection method may be different depending on the application. For example, if a particular object is masked or at least weak in a desirable sound scene in the encoding process, it may be implemented in a form that it is excluded from the object list and included in the masked object, or the two objects are combined and represented as one object .

Another way to form an object group is to separate things that are not point source objects, such as plane wave source objects or ambient source objects, and group them separately. Such sources need different compression encoding methods and parameters because of their different characteristics from point sources, and therefore, it is preferable to separate them separately.

The object information decoded for each group is reduced to the original objects through object grouping by referring to the transferred grouping information.

FIG. 7 shows an example of a bit string generated by encoding by the encoding method according to the present invention. Referring to FIG. 7, it can be seen that the main bitstream 700 in which the encoded channel or object data is transmitted is arranged in the order of the channel groups 720, 730, 740 or the object groups 750, 760, 770. In addition, since the header includes channel group position information CHG_POS_INFO 711 and object group position information OBJ_POS_INFO 712, which are position information in each group bit string, referring to this, it is possible to obtain desired group data Can be decoded first. Therefore, the decoder generally decodes data arriving first in group units, but it can arbitrarily change the order of decoding according to another policy or reason. In addition, FIG. 7 illustrates a sub-bit stream 701 containing metadata 703 and 704 for each channel or object as well as main decoding related information in addition to the main bit stream 700. The sub-bit sequence may be transmitted intermittently during the transmission of the main bit stream, or may be transmitted over a separate transmission channel.

(A method of allocating bits for each object group)

In generating a downmix for a plurality of groups and performing independent parametric object coding for each group, the number of bits used in each group may be different from each other. The criterion for allocating bits per group is the number of objects included in the group, the number of effective objects considering the masking effect among the objects in the group, the weight according to positions considering human spatial resolution, the sound pressure level of objects, The importance of the object on the scene, and the like can be considered. For example, if there are three spatial object groups A, B, and C, if the object signals of group 3, 2, and 1 are included, the allocated bits are 3a1 (nx), 2a2 (ny), a3n Can be assigned. Here, x and y represent the degree to which less bits can be allocated by the masking effect between objects in each group and within the object, and a1 and a2 a3 can be determined by various factors mentioned above for each group.

(Main object and sub object position information encoding in object group)

On the other hand, in the case of object information, it is preferable to have a means for recommending according to an intention created by a producer, or transmitting mix information proposed by another user through metadata as position and size information of an object. In the present invention, this is referred to as preset information for convenience. In the case of position information through a preset, especially when the object is a dynamic object whose position changes with time, the amount of information to be transmitted is not small. For example, if you transmit location information that varies every frame for 1000 objects, it is a very large amount of data. Therefore, it is desirable to effectively transmit the position information of the object. Accordingly, the present invention uses an effective encoding method of position information using the main object and the sub-object definition.

The main object is an object that expresses the position information of the object as an absolute coordinate value in the three-dimensional space. The subobject refers to an object having positional information by expressing the position in the three-dimensional space with respect to the principal object. Therefore, the subordinate object needs to know what the corresponding main object is. In the case of grouping, in particular, when grouping is performed based on the position in the space, one main object and the rest are subordinate objects in the same group, . If there is no grouping for encoding or if it is not advantageous to encode sub-object location information, a separate set for location information encoding may be formed. It is preferable that the objects belonging to the group or the set are located within a certain range in space so that it is more advantageous to express the sub object position information relative to the absolute value.

Another positional information encoding method according to the present invention is to express relative information of a fixed speaker position instead of a relative expression to a main object. For example, the relative position information of the object is expressed based on the designated position value of the 22-channel speaker. At this time, the number of speakers to be used as a reference and the position value can be set based on the value set in the current contents.

In another embodiment of the present invention, quantization is performed after the position information is expressed as an absolute value or a relative value, and the quantization step is variable based on the absolute position. For example, since the frontal area of a celadon is known to have a significantly higher discriminating ability with respect to a position than a side or a rear side, it is desirable to set the quantization step so that the frontal resolution is higher than the lateral resolution. Likewise, since the resolution for the azimuth is higher than the resolution for the azimuth, it is preferable to increase the quantization for the azimuth angle.

According to another embodiment of the present invention, in the case of a dynamic object whose position is time-varying, it is possible to express the relative position with respect to the previous position value of the object instead of expressing the relative position value with respect to the main object or another reference point . Therefore, it is preferable to transmit together the flag information for distinguishing the location information of the dynamic object temporally before and spatially based on the neighboring reference points.

(Decoder overall architecture)

8 is a block diagram of an object and channel signal decoding system according to the present invention. The system may receive either the object signal 801 or the channel signal 802 or a combination of the object signal and the channel signal and the object signal or the channel signal may be subjected to waveform coding 801 and 802 respectively or parametric coding 803 and 804 ). The decoding system can be roughly divided into a 3DA decoding unit 860 and a 3DA rendering unit 870, and the 3DA rendering unit 870 may use any external system or solution. Accordingly, it is desirable that the 3DA decoding unit 860 and the 3DA rendering unit 870 provide a standardized interface that is easily compatible with the outside.

9 is a block diagram of another object and channel signal decoding system according to the present invention. Similarly, the present system can receive the object signal 901 or the channel signal 902 or a combination of the object signal and the channel signal, and the object signal or channel signal can be waveform-encoded 901 902 or parametric 903 904 Can be. 8, the difference is that the individual object decoder 810, the separate channel decoder 820, and the parametric channel decoder 840 and the parametric object decoder 830, which are separated from each other, And the 3DA rendering unit 940 and the renderer interface unit 930 for the convenient and standardized interface are added to the 3D rendering unit 940 and the parametric decoder 920, respectively. The renderer interface unit 930 receives the user environment information, the renderer version, and the like from the 3DA renderer 940 existing in the inside or the outside and reproduces them together with a channel or object signal of a compatible format, So that data can be transmitted. The 3DA renderer interface 930 may include an order controller 1630 to be described later.

The parametric decoder 920 requires a downmix signal to generate an object or channel signal. The required downmix signal is decoded and input through a separate decoder 910. The encoder corresponding to the object and channel signal decoding system may be of various types and can be regarded as a compatible encoder if it can generate at least one of the bit strings 801, 802, 803, 804, 901, 902, 903 and 904 of the form shown in FIG. 8 and FIG. Also, according to the present invention, the decoding system shown in Figs. 8 and 9 is designed to ensure compatibility with past system or bit string. For example, when an individual channel bit stream encoded with AAC is input, it can be decoded through an individual (channel) decoder and transmitted to the 3DA renderer. In the case of an MPS (MPEG Surround) bit stream, a downmix signal is transmitted together with a downmixed AAC encoded signal through a separate (channel) decoder, and the resulting signal is transmitted to a parametric channel decoder. It works like a surround decoder. A bit string encoded by SAOC (Spatial Audio Object Coding) operates similarly. SAOC In the system of FIG. 8, SAOC has a structure of being rendered as a channel through MPEG Surround after being operated as a transcoder as in the prior art. For this purpose, it is desirable that the SAOC transcoder receives the reproduction channel environment information, and generates and transmits an optimized channel signal to the received reproduction channel environment information. Accordingly, conventional SAOC bitstreams can be received and decoded, but rendering specific to a user or playback environment can be performed. In the system of FIG. 9, when the SAOC bit string is inputted, the transcoding operation is converted into an individual object type suitable for channel or rendering instead of the transcoding operation for converting to the MPS bit string. Therefore, the computational complexity is lower than that of the transcoding structure, which is advantageous in terms of sound quality. In FIG. 9, the output of the object decoder is shown as channel only, but it may be transmitted to the renderer interface as an individual object signal. Also, although it is shown only in FIG. 9, when a residual signal is included in the parametric bit stream including the case of FIG. 8, the decoding thereof is decoded through a separate decoder.

(Individual for channel, parameter combination, residual)

10 is a diagram illustrating a configuration of an encoder and a decoder according to another embodiment of the present invention.

FIG. 10 shows a structure for scalable coding when the speaker setup of the decoder is different.

The encoder includes the downmixing unit 210 and the decoder includes the demultiplexing unit 220 and includes at least one of the first decoding unit 230 to the third decoding unit 250.

The downmixing unit 210 generates a downmix signal DMX by downmixing an input signal CH_N corresponding to a multi-channel. In this process, one or more of the upmix parameter (UP) and the upmix residual (UR) is generated. Then, by multiplexing the downmix signal DMX, upmix parameter UP (and upmix residual UR), one or more bitstreams are generated and transmitted to the decoder.

Here, the upmix parameter UP is a parameter required to upmix two or more channels to one or more channels, and may include a spatial parameter and an inter-channel phase difference (IPD).

The upmix residual signal UR corresponds to a residual signal which is a difference between the original signal CH_N and the recovered signal. The recovered signal corresponds to an upmix parameter UP to the downmix DMX And the downmixed signal by the downmixing unit 210 may be a signal encoded in a discrete manner.

The demultiplexer 220 of the decoder may extract the downmix signal DMX and the upmix parameter UP from one or more bitstreams and further extract the upmix residual (UR). Where the residual signal can be encoded in a similar manner to a separate encoding for the downmix signal. Therefore, the decoding of the residual signal is performed through the individual (channel) decoder in the system shown in FIG. 8 or FIG.

(Or one or more) of the first decoding unit 230 to the third decoding unit 250 depending on the speaker setup environment of the decoder. Depending on the type of device (smart phone, stereo TV, 5.1ch home theater, 22.2ch home theater, etc.), the loudspeaker setup environment may vary. If the bit stream and the decoder for generating the multichannel signal of 22.2 channels or the like are not optional, it is necessary to restore all 22.2 channels and then down mix according to the speaker reproduction environment. In this case, not only is the amount of computation required for restoration and downmixing very high, but also delays may occur.

However, according to another embodiment of the present invention, the disadvantage can be solved by selectively providing one (or more) of the first decoder to the third decoder according to the setup environment of each device.

The first decoder 230 is configured to decode only the downmix signal DMX and does not accompany an increase in the number of channels. If the downmix signal is mono, it outputs a mono channel signal, and if it is stereo, it outputs a stereo signal. A device having a headphone with one or two speaker channels, a smart phone, a TV, and the like.

Meanwhile, the second decoder 240 receives the downmix signal DMX and the upmix parameter UP, and generates a parametric M channel PM based on the downmix signal DMX and the upmix parameter UP. When the number of channels is increased as compared with the first decoder and only the parameters corresponding to the upmix up to the total M channels are present in the upmix parameter UP, have. For example, the original signal which is the input signal of the encoder is 22.2ch signal, and the M channel can be 5.1ch, 7.1ch channel and the like.

The third decoder 250 receives not only the downmix signal DMX and the upmix parameter UP but also the upmix residual UR. The second decoder generates the M-channel parametric channel, while the third decoder additionally applies the upmix residual signal UR to the N-channel reconstructed signal.

Each device selectively includes one or more of a first decoder and a third decoder and selectively parses the upmix parameter UP and the upmix residual UR in the bit stream to generate a signal suitable for each speaker setup environment The complexity and the amount of computation can be reduced by creating it directly.

(Object Waveform Coding Considering Masking)

The waveform coder of an object according to the present invention (hereinafter referred to as a waveform coder) refers to a case where a channel or an object audio signal is encoded so that it can be independently decoded for each channel or object, and a concept relative to parametric encoding / (Also referred to as discrete encoding / decoding) is bit-allocated in consideration of the position on the sound scene of the object. This is based on the binaural masking level difference (BMLD) phenomenon of psychoacoustics and the features of object signal coding.

In order to explain the BMLD phenomenon, an example of MS (Mid-Side) stereo coding used in the conventional audio coding method will be described as follows. That is, the masking phenomenon in psychoacoustic is possible when the masker generating the masking and the masking masking are spatially in the same direction. The correlation between the two-channel audio signals of the stereo audio signal is very high. If the sizes are the same, the phase (sound image) of the sound is centered between the two speakers. If there is no correlation, The speaker is concealed. Since the quantization noise of each channel is not correlated with each other when the channel is dual-mono independently of the input signal having the highest correlation, the audio signal is centered, . Therefore, the quantization noise to be masked is not masked due to the spatial inconsistency, resulting in a problem that the noise is distorted to the person. In order to solve such a problem, the sum-difference coding is performed by generating a psychoacoustic model using a signal obtained by subtracting a signal (Mid signal) plus two channel signals (Difference), quantizing the result using the quantized noise, Make sure it is in the same position as the sound image.

In the conventional channel coding, each channel is mapped to a speaker to be reproduced, and since the positions of the speakers are fixed and separated from each other, masking between channels can not be considered. However, when each object is encoded independently, whether the object is masked or not may be changed according to the location on the sound scene of the object. Accordingly, it is desirable to determine whether or not the object currently encoded by another object is masked, allocate and allocate bits according to the masking.

FIG. 11 shows a masking threshold 1130 for each signal for object 1 1110 and object 2 1120, a masking threshold that can be taken from these signals, and a signal that combines object 1 and object 2. Assuming that object 1 and object 2 are located at least at the same position with respect to the position of the listener or within a range in which the problem of BMLD does not occur, the area masked by the signal to the listener will be 1130, The S2 signal included in 1 will be a signal that is completely masked and inaudible. Therefore, in the process of encoding the object 1, it is preferable to encode the object 2 in consideration of the masking threshold value for the object 2. Since the masking threshold values are added together, they can be obtained by adding the respective masking threshold values to the object 1 and the object 2. Or the process of calculating the masking threshold value is very high, it is also preferable to calculate the masking threshold value by using the signal generated by summing the object 1 and the object 2 in advance and then encode the object 1 and the object 2, respectively. FIG. 12 is an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention.

Another masking threshold calculation method according to the present invention is a method of calculating masking thresholds by considering the degree of separation of two objects in space, instead of adding masking thresholds for two objects when the positions of two object signals do not completely coincide with the audible angle reference It is also possible to reflect it by attenuating it. The final joint masking thresholds M1 '(f) and M2' (f) to be used to encode each object when the masking threshold for object 1 is M1 (f) and the masking threshold for object 2 is M2 (f) Is generated so as to have the following relationship.

In this case, A (f) is an attenuation factor generated by the position and distance of space between two objects and the properties of two objects, and has a range of 0.0 = <A (f) = <1.0.

The resolution of a person's direction is deteriorated as it goes from side to side with respect to the front and becomes worse when going backward. Therefore, the absolute position of an object can serve as another factor for determining A (f).

According to another embodiment of the present invention, a masking threshold value of only one of two objects is used, and a masking threshold value of a relative object is acquired only for another object. This is called an independent object dependent object. Since the object using only its own masking threshold is high-quality encoded regardless of the relative object, the sound quality can be preserved even if the object is spatially separated from the object. Assuming that object 1 is an independent object and object 2 is a dependent object, a masking threshold value can be expressed by the following equation.

Whether an independent object or a dependent object is the additional information for each object is preferably decoded and transmitted to the renderer.

In another embodiment according to the present invention, when two objects are similar to each other in space, it is also possible to combine the signals themselves into one object, instead of merely generating the masking thresholds.

In another embodiment according to the present invention, in particular, when parameter coding is performed, it is preferable to combine the two signals into one object in consideration of the correlation between the two signals and the spatial position of the two signals.

(Transcoding feature)

In another embodiment according to the present invention, in transcoding a bit string including a coupled object, particularly when transcoding at a lower bit rate, when the number of objects is reduced in order to reduce the data size, When an object is downmixed into a single object, it is preferable to represent the coupled object as a single object.

In the description of the encoding through the inter-object coupling, only the case of coupling only two objects is described as an example for convenience of explanation, but coupling to two or more objects can also be implemented in a similar manner.

(Flexible rendering required)

Among the technologies required for 3D audio, flexible rendering is one of the key challenges to be solved to maximize the quality of 3D audio. It is well known that the position of the 5.1 channel speaker is very irregular depending on the structure of the living room and the arrangement of the furniture. Even if there is a speaker at such an irregular position, a content producer should be able to provide a sound scene intended by the user. In order to do this, it is necessary to know the speaker environment in the reproduction environment which is different for each user, A rendering technique is needed to compensate for this. That is, a series of techniques are required to decode the transmitted bit stream according to the decoding method, and not to end the codec role, but to optimize and transform it according to the user's reproduction environment.

FIG. 13 shows a layout (gray, 1310) according to the ITU-R recommendation for a 5.1 channel setup and a case (red, 1320) arranged at an arbitrary position. In the actual living room environment, there may be a problem that the direction angle and the distance are different from the ITU-R recommendation. (Although it is not shown, there may be differences in the speaker's height.) It is difficult to provide the ideal 3D sound scene when the original channel signal is reproduced from such a different speaker position.

(Flexible rendering)

Amplitude Panning, which determines the direction information of a sound source between two speakers based on the signal size, or Vector-Based Amplitude Panning (VBAP), which is widely used to determine the direction of a sound source using three speakers in a three-dimensional space It can be seen that flexible rendering can be implemented relatively conveniently for object signals transmitted on an object-by-object basis. It is one of the advantages of transmitting object signals instead of channels.

(Object Decoding and Rendering Structure)

FIG. 14 shows structures (1400 and 1401) of two embodiments in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other according to the present invention. As described above, in the case of an object, there is an advantage in that an object can be positioned as a sound source in accordance with a desired sound scene. Here, the mixer 1420 receives position information represented by a mixing matrix and converts it into a priority channel signal. That is, the positional information of the sound scene is expressed as relative information from the speaker corresponding to the output channel. At this time, if the actual number and position of the speakers do not exist at the predetermined positions, it is necessary to re-render them using the corresponding speaker information. Rendering a channel signal back to another type of channel signal, as described below, is more difficult to implement than when the object is directly rendered on the final channel.

FIG. 15 shows a structure of still another embodiment implementing decryption and rendering of an object bit stream according to the present invention. Compared with the case of FIG. 14, it is possible to directly implement the flexible rendering 1510 according to the final speaker environment together with the decoding from the bit stream. That is, instead of going through two steps of mixing performed on a regular channel based on a mixing matrix and rendering the generated regular channel to a flexible speaker, a rendering matrix or a rendering matrix using a mixing matrix and speaker position information 1520 The rendering parameters are generated, and the object signal is directly rendered to the target speaker using the generated rendering parameters.

(Flexible rendering with channel attached)

On the other hand, when the channel signal is transmitted as an input, when the position of the speaker corresponding to the channel is changed to a certain position, it is difficult to implement using the same panning method in the case of the object and a separate channel mapping process is required. The problem is that the process and solution required for rendering the object signal and the channel signal are different from each other. Therefore, when the object signal and the channel signal are simultaneously transmitted and a sound scene in which the two signals are mixed is desired, Distortion is likely to occur. In order to solve such a problem, according to another embodiment of the present invention, flexible rendering of a channel signal is performed after a mix is first performed on a channel signal without separately performing flexible rendering on an object. Rendering using HRTF is preferably implemented in the same manner.

(Decoded downmix: parameter transmission or automatic generation)

In the case of downmix rendering, in the case of reproducing multi-channel contents through a smaller number of output channels, up to now, it has been common to implement an MN downmix matrix (M is the number of input channels and N is the number of output channels). That is, when 5.1 channel contents are reproduced in stereo, downmix is performed by a given expression. However, such a downmix implementation method has a problem of a calculation amount of decoding all bit strings corresponding to 22.2 transmitted channels, although the user's playback speaker environment is only 5.1 channels. In order to generate a stereo signal for reproduction on a portable device, if all the 22.2 channel signals need to be decoded, the computational burden is very high, and a huge amount of memory waste (storage of 22.2 channel decoded audio signals) occurs.

(Transcoding as an alternative to downmix)

As an alternative to this, a method of switching from a huge 22.2 channel original bit stream to a bit stream suitable for the target device or target playback space through effective transcoding can be considered. For example, if the content is 22.2 channel content stored in the cloud server, it is possible to implement a scenario in which the reproduction environment information is received from the client terminal, and converted and transmitted.

(Decryption sequence or downmix sequence; sequence control unit)

On the other hand, in the scenario in which the decoder and the rendering are separated, for example, there may be a case where it is necessary to decode 50 object signals together with an audio signal of 22.2 channels and transmit the decoded object signals to the renderer. Since it is a data rate signal, there is a problem that a very large bandwidth is required between the decoder and the renderer. Therefore, it is not desirable to simultaneously transmit such a large amount of data at once, and it is desirable to establish an effective transmission plan. Then, it is preferable that the decoder decides the decoding order and transmits it. FIG. 16 is a block diagram showing a structure for determining and transmitting a transmission plan between the decoder and the renderer.

The sequence control unit 1630 receives the supplementary information and metadata obtained through decoding of the bit stream and the transmission sequence for receiving the decoding sequence and the decoded signal from the renderer 1620 and transmitting the decoded sequence and the decoded signal to the renderer 1620, And transmits the determined control information to the decryptor 1610 and the renderer 1620 again. For example, if the renderer 1620 commands to completely remove a particular object, then it does not need to be sent to the renderer 1620, nor does it need to be decoded. Alternatively, in a situation where specific objects are rendered only on a specific channel, instead of separately transmitting the object, the transmission bandwidth may be reduced by downmixing the object in advance to the corresponding channel to be transmitted. In another embodiment, when sound scenes are spatially grouped and signals necessary for rendering are grouped together, it is possible to minimize the amount of signals that need to be unnecessarily queued in the renderer internal buffer. On the other hand, according to the renderer 1620, the size of data that can be accommodated at a time may be different. Such information may also be notified to the order controller 1630, and the decoder 1610 may determine the decoding timing and the amount of transmission.

On the other hand, the decoding control by the sequence control unit 1630 is further transmitted to the encoding stage, and the encoding process can be controlled. That is, it is possible to exclude unnecessary signals in encoding, determine grouping of objects and channels, and the like.

(Voice highway)

On the other hand, an object corresponding to a voice corresponding to bidirectional communication may be included in the bit stream. Since bidirectional communication is very sensitive to time delay unlike other contents, when the corresponding object or channel signal is received, it should be transmitted to the renderer first. The corresponding object or channel signal can be indicated by a separate flag or the like. First of all, the transport object is independent of the other object / channel signals contained in the same frame and the presentation time, unlike the other object / channel.

(AV Matching and Phantom Center)

UHDTV is one of the new problems when considering ultra-high-definition TV, which is often called a near field. That is, considering the viewing distance of a general user environment (living room), the distance from the reproduced speaker to the listener is shorter than the distance between the speakers, so that each speaker operates as a point source, In the absence of a speaker at the center, the spatial resolution of the sound object synchronized to the video must be very high to enable high quality 3D audio service.

In a conventional audio angle of about 30 degrees, stereo speakers arranged on the left and right are not in a near field situation, and are sufficient to provide a sound scene suitable for movement of an object on the screen (for example, a vehicle moving from left to right). However, in the UHDTV environment where the audiovisual angle is 100 degrees, not only the left and right resolutions but also additional resolutions constituting the top and bottom of the screen are required. For example, if there are two characters on the screen, it would not be a big problem in realistic sense even if it is said that two sounds are ignited in current HDTV. However, in UHDTV size, Will be perceived as a new form of distortion.

One of the solutions to this problem is the configuration of 22.2 channel speaker configuration. Figure 2 is an example of a 22.2 channel arrangement. 2, a total of eleven loudspeakers are arranged on the front side to increase the left and right spatial resolution and the vertical spatial resolution of the front side. Five speakers are placed in the middle layer of the former three speakers. In addition, three upper layers and three lower layers were added so that the sound level could be sufficiently accommodated. By using such an arrangement, the spatial resolution of the front surface is higher than that of the past, which will be advantageous for matching with the video signal. However, there is a problem that, in current TVs using display devices such as LCDs and OLEDs, the display occupies a position where a speaker should be present. That is, there is a problem in that the speakers must be provided outside the display area to provide a matched sound to each object position in the screen, unless the display itself provides sound or does not have a device character that penetrates the sound. In FIG. 2, speakers corresponding to the minimums FLc, FC, and FRc are disposed at positions overlapping with the display.

17 is a conceptual diagram for explaining a concept of reproducing speakers absent from the display among the front-mounted speakers in the 22.2 channel system using the peripheral channels. In order to accommodate the components FLc, FC and FRc, it is also conceivable to arrange additional speakers in the upper and lower peripheral portions of the display, such as circles indicated by dotted lines. According to FIG. 17, there may be seven peripheral channels that can be used to generate the FLc. By using these seven speakers, it is possible to reproduce a sound corresponding to the position of a member speaker by the principle of generating a virtual source.

Techniques and properties such as VBAP or HAAS Effect (pre-effect) can be used as a method of creating virtual sources using peripheral speakers. Alternatively, different panning techniques may be applied depending on the frequency band. Further, it is possible to consider changing the azimuth using the HRTF and adjusting the height. For example, when replacing FC with BtFC, it can be implemented by adding FC channel signal to BtFC by applying HRTF with ascending property. The HRTF observation is that you have to control the position of a specific Null in the high frequency band (which varies from person to person) in order to control the height of the sound. However, in order to generalize a different Null according to a person, height adjustment can be implemented by increasing or decreasing the high frequency band widely. If such a method is used, there is a disadvantage that the signal is distorted due to the influence of the filter.

The processing method for arranging the sound sources at the position of the member speaker according to the present invention is the same as that shown in Fig. Referring to FIG. 18, a channel signal corresponding to the phantom speaker position is used as an input signal, and the input signal is passed through a subband filter unit 1810 for dividing into three bands. In this case, instead of dividing into two bands or dividing into three bands instead of three bands, a method of performing different processing on the upper two bands may be implemented. The first band is relatively low in the low frequency band, but is a signal that can be reproduced via a woofer or subwoofer, since it is desirable to reproduce through a large speaker instead of being positionally insensitive. At this time, the first band signal adds a time delay 1820 to utilize the pre-effect. In this case, the time delay is not intended to compensate for the time delay of the filter occurring in the processing of the other bands, but provides an additional time delay to reproduce later than other band signals, that is, to provide a preceding effect.

The second band is a signal to be used to be reproduced through a speaker (a speaker disposed in the bubble of the TV display and the surroundings thereof) around the phantom speaker, is divided into at least two speakers and reproduced, and a panning algorithm 1830 such as VBAP is applied Are generated and applied. Therefore, it is necessary to accurately provide the number and position (relative to the phantom speaker) of the speakers on which the second band output is reproduced, so that the panning effect can be improved. In this case, it is possible to apply a filter considering HRTF in addition to VBAP panning, or to apply different phase filters or time delay filters to provide a time panning effect. Another advantage of dividing the bands into HRTFs is that they can limit the range of signal distortions caused by HRTF to within the processing band.

The third band may be used to generate a signal to be reproduced using a speaker array if present, and an array signal processing technique 1840 for sound source virtualization through at least three speakers may be applied. Or coefficients generated by WFS (Wave Field Synthesis) can be applied. In this case, the third band and the second band may actually be the same band.

FIG. 19 shows an embodiment in which a signal generated in each band is mapped to a speaker disposed in the vicinity of the TV. According to FIG. 19, the number and position information of the speakers corresponding to the second and third bands must be located at relatively accurately defined positions, and the position information is preferably provided to the processing system of FIG.

(Composite speaker)

The composite speaker is divided into at least two audio frequency bands and has at least two kinds of speakers suitable for each band. Composite speakers divided into N bands are also called N-way speakers. At this time, the boundary frequency or cutoff frequency of each band is referred to as a crossover frequency. 20 shows an example of the frequency response of two speakers constituting a two-way speaker. The frequency response 2020 of the woofer and the frequency response 2030 of the tweeter are divided into two bands with the crossover frequency 2010 as a boundary.

FIG. 21 is a schematic view showing at least two or more compound speakers having different azimuth angles and different heights. For convenience, the two-way speaker is used, but the present invention is applicable to all N-way speakers. Suppose that the frequency response of the woofer 2110 is the woofer frequency response 2020 of FIG. 20 and the frequency response of the tweeter 2120 is the tweeter frequency response 2030 of FIG. In case of two signals having high correlation, even if they have different frequency components, they affect the sound localization. Accordingly, when a speaker is present as shown in FIG. 21, a signal having a high degree of correlation with the woofer 2110 and the tweeter 2120 is output, so that an image can be formed between the upper speaker 2130 and the lower speaker 2140. In general, when excluding percussion instruments such as drums, most of the sounds constituting the music have a harmonic structure. Therefore, even though the two band-pass filters having the crossover frequency 2010 are passed, the two band-pass filtered signals passed through the band-pass filter have a high correlation because they have a harmonic relation. Therefore, when the audio signal is turned on the tweeter 2120 of the lower speaker and the woofer 2150 of the upper speaker in FIG. 21, the effect is the same as reproducing in the inverted virtual speaker 2170 existing in the specific region 2160 .

The detent effect refers to a phenomenon in which when a sound image is positioned on a plurality of speakers, the sound image is not positioned at a desired position but pulled toward the speaker as it moves away from the speaker. This phenomenon causes the sound image to be shifted from one speaker to another speaker as if it does not move continuously and it moves like discontinuity. When the sound image is positioned using the conventional amplitude panning method, the sound image is not accurately positioned due to the detent effect. However, if you create and play back a virtual speaker, the sound image may be correctly positioned between the speakers.

FIG. 22 is a diagram of a method for making such a virtual speaker. Using the crossover frequency of the mixed loudspeaker and the input signal, the energy ratio of two or more bands is obtained at the boundary of the crossover frequency. The gain of the output channel can be calculated using the energy ratio thus obtained. Since the position of the virtual speaker may be changed according to the physiological characteristic of the user, when the user's physiological characteristic is given as an input, the gain calculator 2210 compensates the gain according to each characteristic.

FIG. 23 is a diagram illustrating a relationship among products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented. Referring to FIG. 24, the wired / wireless communication unit 310 receives a bitstream through a wired / wireless communication scheme. More specifically, the wired / wireless communication unit 310 may include at least one of a wired communication unit 310A, an infrared communication unit 310B, a Bluetooth unit 310C, and a wireless LAN communication unit 310D.

The user authentication unit 320 performs user authentication by receiving the user information and performs at least one of the fingerprint recognition unit 320A, the iris recognition unit 320B, the face recognition unit 320C, and the voice recognition unit 320D The fingerprint, iris information, face contour information, and voice information may be input and converted into user information, and user authentication may be performed by determining whether the user information and the previously registered user data match each other .

The input unit 330 may include at least one of a keypad unit 330A, a touch pad unit 330B, and a remote control unit 330C as an input device for a user to input various kinds of commands. It is not limited.

The signal coding unit 340 performs encoding or decoding on the audio signal and / or the video signal received through the wired / wireless communication unit 310, and outputs the audio signal in the time domain. And an audio signal processing device 345. This corresponds to the above-described embodiment of the present invention (i.e., the decoder 600 according to one embodiment and the encoder and decoder 1400 according to another embodiment) Likewise, the audio processing unit 345 and the signal coding unit including it may be implemented by one or more processors.

The control unit 350 receives an input signal from the input devices and controls all the processes of the signal decoding unit 340 and the output unit 360. The output unit 360 is a component for outputting the output signal and the like generated by the signal decoding unit 340 and may include a speaker unit 360A and a display unit 360B. When the output signal is an audio signal, the output signal is output to the speaker, and when it is a video signal, the output signal is output through the display.

The audio signal processing method according to the present invention may be implemented as a program to be executed by a computer and stored in a computer-readable recording medium. The multimedia data having the data structure according to the present invention may also be recorded on a computer- Lt; / RTI > The computer-readable recording medium includes all kinds of storage devices in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . In addition, the bit stream generated by the encoding method may be stored in a computer-readable recording medium or transmitted using a wired / wireless communication network.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It will be understood that various modifications and changes may be made without departing from the scope of the appended claims.

210: Upper layer
220: middle layer
230: bottom layer
240: LFE channel
310: wired / wireless communication device
310A: Wired communication device
310B: Infrared device
310C: Bluetooth device
310D: Wireless LAN device
320: User authentication device
320A: Fingerprint Recognition
320B: iris recognition device
320C: Facial recognition system
320D: Speech Recognition Device
330: input device
330A: Keypad
330B: Touchpad
330C: Remote control unit
340: Signal encoding device
345: audio signal processing device
350: Control device
360: Output device
360A: Speaker
360B: Display
410: first object signal group
420: second object signal group
430: Celadon
520: Downmixer & parameter encoder 1
550: Object grouping unit
560: Waveform coder
620: Waveform Decoder
630: Upmixer & Parameter Decoder 1
670: Object D grouping unit
860: 3DA decoding unit
870: 3DA rendering unit 870
1110: Masking threshold curve of object 1
1120: masking threshold curve of object 2
1130: masking threshold curve of object 3
1230: Psychoacoustic model
1310: 5.1 channel loudspeaker placed according to ITU-R Recommendation
1320: 5.1 channel loudspeaker is in any position
1610: 3D audio decoder
1620: 3D Audio Renderer
1630:
1810: Subband filter section
1820: Time delay filter unit
1830: Panning Algorithm
1840: speaker array control unit
2010: Crossover frequency
2020: Frequency response of woofer speaker
2030: Frequency response of tweeter speaker
2010: Woofer speaker of lower speaker
2020: Tweeter speaker of lower speaker
2030: Lower speaker
2040: Upper speaker
2050: Upper speaker's woofer speaker
2060: presence area of virtual speaker
2070: Virtual Speaker
2080: gain calculator

Claims

As an audio signal processing method,
Obtaining position information of each of the plurality of composite speakers;
Obtaining a crossover frequency of the composite speaker;
Receiving a bit string including an object signal and object position information;
Decoding the object signal and the object location information from the received bit stream;
Calculating energy distribution information of the decoded object signal;
Selecting two or more composite loudspeakers among the plurality of composite loudspeakers using the decoded object position information;
Generating an output gain value of the selected two or more composite speakers using the crossover frequency and the energy distribution of the object signal;
And rendering the decoded object signal into (?) Channel signals using the generated gain value.

The method of claim 1,
Wherein the step of generating the gain value calculates a gain value using a physiological characteristic of the user.

3. The method of claim 2,
Wherein the physiological characteristic is extracted using an image or an image.

3. The method of claim 2,
Wherein the physiological characteristic includes at least one of information on a shape of a head, a body of a user, and a shape of an external auditory canal of a user

The method according to claim 1,
And the energy distribution information of the object signal includes an energy value of the object signal.