KR102059846B1 - Apparatus and method for audio signal processing - Google Patents

Apparatus and method for audio signal processing Download PDF

Info

Publication number
KR102059846B1
KR102059846B1 KR1020120084231A KR20120084231A KR102059846B1 KR 102059846 B1 KR102059846 B1 KR 102059846B1 KR 1020120084231 A KR1020120084231 A KR 1020120084231A KR 20120084231 A KR20120084231 A KR 20120084231A KR 102059846 B1 KR102059846 B1 KR 102059846B1
Authority
KR
South Korea
Prior art keywords
object
signal
information
channel
signal group
Prior art date
Application number
KR1020120084231A
Other languages
Korean (ko)
Other versions
KR20140017344A (en
Inventor
오현오
송정욱
송명석
전세운
이태규
Original Assignee
인텔렉추얼디스커버리 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 인텔렉추얼디스커버리 주식회사 filed Critical 인텔렉추얼디스커버리 주식회사
Priority to KR1020120084231A priority Critical patent/KR102059846B1/en
Priority claimed from EP13825888.4A external-priority patent/EP2863657B1/en
Publication of KR20140017344A publication Critical patent/KR20140017344A/en
Application granted granted Critical
Publication of KR102059846B1 publication Critical patent/KR102059846B1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network, synchronizing decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding, i.e. using interchannel correlation to reduce redundancies, e.g. joint-stereo, intensity-coding, matrixing

Abstract

The present invention relates to a method and apparatus for processing an object audio signal, and more particularly, to a method and apparatus for encoding and decoding an object audio signal or rendering in a three-dimensional space.
According to an aspect of the invention, the receiving unit for receiving the rendering information for the audio signal of at least two channels or objects, the control unit for determining the decoding performance instruction information for the audio signal using the rendering information, and the decoding execution command An audio signal processing apparatus including a command transfer unit for transferring information to an audio signal decoder may be provided.

Description

Method and apparatus for processing audio signal {APPARATUS AND METHOD FOR AUDIO SIGNAL PROCESSING}

The present invention relates to a method and apparatus for processing an object audio signal, and more particularly, to a method and apparatus for encoding and decoding an object audio signal or rendering in a three-dimensional space.

3D audio is a series of signal processing to provide a realistic sound in three-dimensional space by providing another dimension in the height direction to the sound scene (2D) on the horizontal plane provided by conventional surround audio, Commonly referred to as transmission, encoding, and reproduction techniques. In particular, in order to provide 3D audio, a rendering technology that requires sound images to be formed at a virtual position where no speaker exists even if a larger number of speakers or a smaller number of speakers are used is widely required.

3D audio is expected to be an audio solution that is compatible with upcoming ultra-high definition televisions (UHDTVs), as well as theater sound, personal 3DTVs, tablets, smartphones, and clouds, as well as sound in vehicles that are evolving into high-quality infotainment spaces. It is expected to be applied to a variety of applications.

3D audio first needs to transmit signals of more channels than conventional ones up to 22.2 channels, which requires a suitable compression transmission technique. Conventional high quality coding such as MP3, AAC, DTS, AC3, etc. has been mainly optimized for transmitting only channels less than 5.1 channels.

In addition, in order to reproduce 22.2 channel signals, an infrastructure for listening space with 24 speaker systems is required. Since it is not easy to spread in the market for a short period of time, a technology for effectively reproducing 22.2 channel signals in a space with a smaller number of speakers is required. On the contrary, the technology that allows existing stereo or 5.1-channel sound to be reproduced in a larger number of speakers, such as 10.1 channel and 22.2 channel environment, and furthermore, the sound provided by the original sound source outside of the prescribed speaker position and the specified listening room environment The technology to provide a scene and the technology to enjoy 3D sound in a headphone listening environment are required. Such techniques are referred to herein as rendering, and are specifically referred to as downmix, upmix, flexible rendering, binaural rendering, and the like.

Meanwhile, an object-based signal transmission scheme is required as an alternative for effectively transmitting such a sound scene. Depending on the sound source, it may be more advantageous to transmit on an object basis than to transmit on a channel basis. In addition, when transmitting on an object basis, the user may arbitrarily control the playback size and position of the objects. To make it possible. Accordingly, there is a need for an effective transmission method capable of compressing an object signal at a high data rate.

In addition, there may also be a sound source in which the channel-based signal and the object-based signal are mixed, thereby providing a new type of listening experience. Accordingly, there is a need for a technique for effectively transmitting channel signals and object signals together and rendering them effectively.

According to an aspect of the invention, the receiving unit for receiving the rendering information for the audio signal of at least two channels or objects, the control unit for determining the decoding performance instruction information for the audio signal using the rendering information, and the decoding execution command An audio signal processing apparatus including a command transfer unit for transferring information to an audio signal decoder may be provided.

According to another aspect of the invention, receiving a bit string for generating an audio signal of at least two channels or objects, receiving decoding execution command information, decoding the audio signal using the decoding execution command information An audio signal processing method may include providing order information, reading the channel or object audio frame data from the bit stream according to the decoding order information, and decoding the audio frame data.

According to another aspect of the present invention, at least one object signal, the rendering meta information for the object signal, the receiving unit for receiving user speaker environment information, the object signal with reference to the rendering meta information and the user speaker environment information An audio signal processing apparatus including a speaker mapping unit for mapping to a speaker to be reproduced, and an information transfer unit for grouping objects mapped to one speaker according to a mapping result of the speaker mapping unit and transferring the grouped information to a decoder. Can be.

According to the present invention, an audio signal can be effectively represented, encoded, transmitted and stored, and high quality audio signals can be reproduced through various reproduction environments and devices.

The effects of the present invention are not limited to the above-described effects, and effects that are not mentioned will be clearly understood by those skilled in the art from the present specification and the accompanying drawings.

1 is a view for explaining a viewing angle according to an image size at the same viewing distance;
2 is a layout diagram of speaker arrangement of 22.2ch as an example of a multi-channel;
3 is a conceptual diagram showing the position of each sound object in a listening space where a listener listens to 3D audio;
4 is an exemplary configuration diagram of forming an object signal group using the grouping method according to the present invention with respect to the objects shown in FIG.
5 is a block diagram of an embodiment of an encoder of an object audio signal according to the present invention.
6 is an exemplary configuration diagram of a decoding apparatus according to an embodiment of the present invention.
7 illustrates an embodiment of a bit string generated by encoding by the encoding method according to the present invention.
8 is a block diagram illustrating an object and channel signal decoding system according to the present invention.
9 is a block diagram of another object and channel signal decoding system according to the present invention.
10 illustrates an embodiment of a decryption system according to the present invention.
11 is a diagram illustrating a masking threshold for a plurality of object signals according to the present invention.
12 is an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention.
FIG. 13 is a diagram for explaining a case where an arrangement is made according to the ITU-R recommendation and an arbitrary position for a 5.1-channel setup. FIG.
14 illustrates a structure of an embodiment in which a decoder for an object bit string and a flexible rendering system using the same are connected according to the present invention.
15 is a structure of another embodiment implementing decoding and rendering on an object bit string according to the present invention.
16 is a diagram illustrating a structure for determining and transmitting a transmission plan between a decoder and a renderer
FIG. 17 is a conceptual diagram for explaining a concept of reproducing speakers absent by a display among front-facing speakers in a 22.2 channel system by using peripheral channels thereof; FIG.
18 is an embodiment of a processing method for disposing a sound source at a member speaker position according to the present invention;
19 is a diagram for mapping a signal generated in each band to a speaker disposed around a TV.
20 is a diagram illustrating a relationship between products in which an audio signal processing device according to an embodiment of the present invention is implemented.

According to an aspect of the invention, the receiving unit for receiving the rendering information for the audio signal of at least two channels or objects, the control unit for determining the decoding performance instruction information for the audio signal using the rendering information, and the decoding execution command An audio signal processing apparatus including a command transfer unit for transferring information to an audio signal decoder may be provided.

Here, the controller may calculate the position information on the sound scene of the channel and the object of the audio signal from the rendering information.

Here, the controller may provide grouped information by grouping the channel and object signals of the audio signal from the rendering information.

According to another aspect of the invention, receiving a bit string for generating an audio signal of at least two channels or objects, receiving decoding execution command information, decoding the audio signal using the decoding execution command information An audio signal processing method may include providing order information, reading the channel or object audio frame data from the bit stream according to the decoding order information, and decoding the audio frame data.

Here, the decoding execution command information may be received from an external audio post-processing device or a control device connected to the audio post-processing device.

Here, the decoding order information may have a feature correlated with the processing order in the post-processing device for the channel or object audio signal.

According to another aspect of the present invention, at least one object signal, the rendering meta information for the object signal, the receiving unit for receiving user speaker environment information, the object signal with reference to the rendering meta information and the user speaker environment information An audio signal processing apparatus including a speaker mapping unit for mapping to a speaker to be reproduced, and an information transfer unit for grouping objects mapped to one speaker according to a mapping result of the speaker mapping unit and transferring the grouped information to a decoder. Can be.

Here, the information transfer unit may be transferred to the audio decoder for decoding the received object signal.

Since the embodiments described herein are intended to clearly explain the spirit of the present invention to those skilled in the art to which the present invention pertains, the present invention is not limited to the embodiments described herein, and the present invention. The scope of should be construed to include modifications or variations without departing from the spirit of the invention.

The terms used in the present specification and the accompanying drawings are for easily explaining the present invention, and the shapes shown in the drawings are exaggerated and displayed to help understanding of the present invention as necessary, and thus, the present invention is used in the present specification. It is not limited by the terms and the accompanying drawings.

In the present specification, when it is determined that a detailed description of a known configuration or function related to the present invention may obscure the gist of the present invention, a detailed description thereof will be omitted as necessary.

In the present invention, the following terminology may be interpreted based on the following criteria, and terms not described may be interpreted according to the following meanings. Coding can be interpreted as encoding or decoding in some cases, and information is a term that encompasses values, parameters, coefficients, elements, and so on. It may be interpreted otherwise, but the present invention is not limited thereto.

Hereinafter, a method and apparatus for processing an object audio signal according to an embodiment of the present invention will be described.

1 is a view for explaining a viewing angle according to an image size (eg, UHDTV and HDTV) on the same viewing distance. Display manufacturing technology is developed, and the image size is increasing in accordance with the needs of consumers. As shown in FIG. 1, the UHDTV (7680 * 4320 pixel image 110) is about 16 times larger than the HDTV (1920 * 1080 pixel image 120). If an HDTV is installed on the living room wall and the viewer is sitting on the living room couch with a certain viewing distance, the viewing angle may be about 30 degrees. However, when the UHDTV is installed at the same viewing distance, the viewing angle reaches about 100 degrees. As such, when a large screen having high definition and high resolution is installed, it may be desirable to provide a sound having a high sense of presence and presence to match the large content. In order to provide the viewer with almost the same experience as in the field, having 1-2 surround channel speakers may not be enough. Thus, a multichannel audio environment with more speakers and channel numbers may be required.

As described above, in addition to home theater environments, personal 3D TVs, smartphone TVs, 22.2 channel audio programs, automobiles, 3D video, remote presence rooms, cloud-based gaming, etc. There may be.

2 is a diagram illustrating a speaker arrangement of 22.2ch as an example of a multi-channel. 22.2ch may be an example of a multi-channel environment for enhancing the sound field, and the present invention is not limited to a specific number of channels or a specific speaker arrangement. Referring to FIG. 2, a total of nine channels may be provided in the highest layer 1010. You'll see a total of nine speakers, three in the front, three in the middle and three in the surround. In the middle layer 1020, five speakers in front, two in the middle position, and three speakers in the surround position may be disposed. Of the five speakers in the front, three of the center positions can be included in the TV screen. A total of three channels and two LFE channels 1040 may be installed on the bottom layer 1030.

As described above, in order to transmit and reproduce a multi-channel signal of up to several dozen channels, a high amount of computation may be required. In addition, a high compression ratio may be required in consideration of a communication environment. In addition, in a typical home, there are not many multichannel (eg 22.2ch) speaker environments, and many listeners have 2ch or 5.1ch setups. Therefore, a signal commonly transmitted to all users encodes each multichannel. In the case of sending, when the multichannel is to be converted back to 2ch and 5.1ch and reproduced, not only communication inefficiency occurs but also 22.2ch of PCM signal must be stored, which may result in inefficiency in memory management.

3 is a conceptual diagram illustrating positions of respective sound objects 120 constituting a three-dimensional sound scene in the listening space 130 where the listener 110 listens to 3D audio. Referring to FIG. 3, although each object 120 is represented as a point source for convenience of schematic drawing, in addition to the point source, a plane wave-type sound source or an ambient sound source (space of a sound scene) is shown. There may be sound that is spread across all recognizable directions.

FIG. 4 shows that the object signal groups 410 and 420 are formed using the grouping method according to the present invention for the schematic objects of FIG. 3. According to the present invention, in encoding or processing an object signal, an object signal group is formed to encode or process grouped objects in units. In this case, the encoding includes a case of discrete coding an object as an individual signal or a case of performing parametric coding on an object signal. In particular, according to the present invention, in the generation of the downmix signal for parameter encoding of the object signal and the generation of parameter information of the objects corresponding to the downmix, the grouped objects are generated in units. That is, in the conventional SAOC coding technique, all objects constituting the sound scene may be one downmix signal (the downmix signal may be mono (one channel) or stereo (two channel), but for convenience, one) And the corresponding object parameter information. However, as in the scenario considered in the present invention, more than 20 objects, and as many as 200 and 500 are represented by one downmix and the corresponding object parameter information. When expressed in terms of parameters, upmix and rendering that provide the desired level of sound quality is virtually impossible. Accordingly, the present invention uses a method of grouping objects to be encoded to generate a downmix in group units. The downmix gain may be applied when each object is downmixed in the group-down process, and the applied downmix gain for each object is included in the bit string for each group as additional information. Meanwhile, global gain commonly applied to each group and object group gain limited to objects of each group may be used for efficient coding or effective control of the overall gain. Is sent to.

The first method of forming a group is a method of forming a group between nearby objects in consideration of the position of each object in the sound scene. The object groups 410 and 420 of FIG. 4 are one example formed in this manner. This is because imperfection of parameter encoding prevents crosstalk distortion between each object or distortion caused when performing rendering that moves or resizes objects to a third position. Way. Distortions on objects in the same location are relatively invisible to listeners by masking. For the same reason, even in the case of performing individual encoding, an effect of sharing additional information through grouping between objects that are spatially similar may be expected.

5 is a block diagram of an embodiment of an encoder of an object audio signal including the method of object grouping 550 and downmix 520, 540 in accordance with the present invention. Downmixing is performed for each group, and in this process, parameters necessary to restore downmixed objects are generated (520 and 540). The downmix signals generated for each group are additionally coded by a waveform encoder 560 for encoding channel-specific waveforms such as AAC and MP3. This is commonly called Core codec. In addition, encoding may be performed through coupling between each downmix signal. The signal generated by each encoder is formed and transmitted as one bit string through the mux 570. Accordingly, the bit streams generated by the downmix & parameter encoders 520 and 540 and the waveform encoder 560 can be regarded as a case of encoding component objects forming one sound scene. In addition, object signals belonging to different object groups in the generated bit strings are encoded with the same time frame, and thus have a feature of being reproduced in the same time zone. The grouping information generated by the object grouping unit may be encoded and transmitted to the receiving end.

FIG. 6 is a block diagram illustrating an embodiment of decoding a signal that is encoded and transmitted as described above. The decoding process is an inverse process of encoding, and the plurality of downmix signals subjected to the waveform decoding 620 are input to the upmixer & parameter decoder together with the corresponding parameters. Since there are a plurality of downmixes, a plurality of parameter decodings are required.

If the transmitted bit string includes the global gain and the object group gain, they may be applied to restore the normal object signal. On the other hand, in the rendering or transcoding process, these gain values can be controlled, and the gain of each group can be controlled through the object group gain through the global gain adjustment. For example, when object grouping is performed on a playback speaker basis, when gain is adjusted to implement flexible rendering, which will be described later, the object group gain may be easily adjusted.

In this case, although a plurality of parameter encoders or decoders are illustrated as being processed in parallel for convenience of description, it is also possible to sequentially encode or decode a plurality of object groups through one system.

Another method of forming an object group is to group objects having low correlation with each other into one group. This is a feature of parameter coding that considers features that are highly correlated objects that are difficult to separate from the downmix. In this case, an encoding method may be performed such that each grouped object is further correlated by adjusting a parameter such as a downmix gain when downmixing. In this case, the used parameter is preferably transmitted so that it can be used for signal recovery.

Another method of forming an object group is to group objects highly correlated with each other into one group. This makes it difficult to separate parameters using highly correlated objects, but it is a method to increase the compression efficiency in such a non-utilized application. In the case of a core codec, a complex signal having various spectra requires a lot of bits. Therefore, if one core codec is used by tying up highly correlated objects, encoding efficiency is high.

Another method of forming an object group is to determine whether to mask between objects and to encode it. For example, when object A has a relationship to mask object B, when two signals are included in one downmix and encoded by a core codec, object B may be omitted in the encoding process. In this case, the distortion is large when the object B is obtained by using the parameter at the decoding stage. Therefore, it is preferable to include the object A and the object B having such a relationship in a separate downmix. On the other hand, if objects A and B are in a masking relationship, but do not need to render the two objects separately, or at least do not need to handle the masked objects separately, then objects A and B are downmixed to one. It is preferable to include in the. Therefore, the selection method may differ depending on the application. For example, if a specific object is masked or disappeared or at least weak in a sound scene that is desirable in the encoding process, the object may be excluded from the object list and included in the object to be a masker, or the two objects may be expressed as one object. .

Another way to form object groups is to separate non-point source objects, such as plane wave source objects or ambient source objects, and group them separately. Such sources require different types of compression coding methods or parameters due to their different characteristics from point sources, and therefore, it is preferable to separately process them.

The decoded object information for each group is reduced to original objects through object degrouping with reference to the transmitted grouping information.

7 is an embodiment of a bit string generated by encoding by the encoding method according to the present invention. Referring to FIG. 7, it can be seen that the main bit string 700 in which the encoded channel or object data is transmitted is arranged in the order of the channel group 720, 730, 740, or the object group 750, 760, 770. In addition, since the header includes channel group position information CHG_POS_INFO (711) and object group position information OBJ_POS_INFO (712), which are position information in the bit string of each group, the data of the desired group without sequentially decoding the bit string are referred to. Only the first can be decoded. Therefore, the decoder generally performs decoding from data that arrives first in group units, but may change the decoding order arbitrarily according to other policies or reasons. 7 illustrates a sub-bit string 701 that includes metadata 703 and 704 for each channel or object together with main decoding related information separately from the main bit string 700. The sub bit string may be transmitted intermittently in the middle of the main bit string transmission or may be transmitted through a separate transport channel.

(How to assign bit by object group)

In generating a downmix for a plurality of groups and performing independent parametric object encoding for each group, the number of bits used in each group may be different. The criteria for allocating bits per group include the number of objects included in the group, the number of effective objects considering the masking effect between the objects in the group, the weight according to the position considering the spatial resolution of the person, the sound pressure of the objects, the correlation between the objects, and the sound. The importance of objects in the scene can be taken into account. For example, in the case of three spatial object groups A, B, and C, if 3, 2, and 1 object signals of each group are included, the allocated bits are 3a1 (nx), 2 2a2 (ny), and a3n. Can be assigned. Here, x and y refer to the extent to which less bits may be allocated by the masking effect between and within objects in each group, and a1, a2 and a3 may be determined by the above-mentioned various factors for each group.

(Encoding of location information of main object and sub object in object group)

On the other hand, in the case of object information, it is desirable to have a means for delivering mix information, etc. recommended by a producer or suggested by another user through metadata as position and size information of an object. In the present invention, this is called preset information for convenience. In the case of position information through presets, particularly in the case of a dynamic object whose position varies with time, the amount of information to be transmitted is not small. For example, if you transmit location information that changes every frame for 1000 objects, you get a very large amount of data. Therefore, it is desirable to transmit the location information of the object effectively. Therefore, the present invention uses an effective encoding method of the location information using the definition of the main object and the sub-object.

The main object is an object that expresses the location information of the object in absolute coordinates in three-dimensional space. The sub-object represents an object having location information by expressing a location in a three-dimensional space as a value relative to the main object. Therefore, the sub-object needs to know what the corresponding main object is. When performing grouping, especially when grouping based on the location in space, one sub-object and the rest of the sub-object are represented in the same group. Can be implemented. If there is no grouping for encoding or if it is not advantageous to use sub-object location information encoding, a separate set for location information encoding may be formed. It is preferable that objects belonging to a group or set be located within a certain range in space, in order that relative representation of sub-object position information is advantageous over representation of an absolute value.

Another method of encoding the position information according to the present invention is to represent the relative information on the fixed speaker position instead of the relative expression on the main object. For example, the relative position information of the object is expressed based on the designated position value of the 22 channel speaker. At this time, the number of speakers and position values to be used as a reference may be made based on the value set in the current content.

In another embodiment according to the present invention, the position information is expressed as an absolute value or a relative value and then quantization is performed, wherein the quantization step is variable based on the absolute position. For example, since the front side of the listener is known to have a much higher discrimination ability with respect to the position than the side or the back side, it is preferable to set the quantization step so that the resolution for the front side is higher than the resolution for the side. Similarly, since the resolution for azimuth is higher than that for height and height, it is desirable to make the quantization for azimuth angle higher.

In another embodiment according to the present invention, in the case of a dynamic object whose position is time-varying, instead of expressing a position value relative to the main object or another reference point, it is possible to express the value relative to the previous position value of the object. . Therefore, it is preferable that the position information on the dynamic object is transmitted together with flag information for distinguishing which of the neighboring reference points has been previously temporally and spatially.

(Decoder full architecture)

8 is a block diagram illustrating an object and channel signal decoding system according to the present invention. The system may receive an object signal 801 or a channel signal 802 or a combination of an object signal and a channel signal, and the object signal or channel signal may be waveform coded 801 or 802 or parametric coded 803 or 804, respectively. May be The decoding system may be largely divided into a 3DA decoder 860 and a 3DA renderer 870, and any external system or solution may be used for the 3DA renderer 870. Accordingly, the 3DA decoder 860 and the 3DA renderer 870 preferably provide a standardized interface that is easily compatible with the outside.

9 is a block diagram of another object and channel signal decoding system according to the present invention. Similarly, the system may receive the object signal 901 or the channel signal 902 or a combination of the object signal and the channel signal, and the object signal or the channel signal may be waveform coded (901,902) or parametric coded (903,904), respectively. There may be. Compared with the system of FIG. 8, the difference is that each of the separate object decoder 810, the channel decoder 820, and the parametric channel decoder 840 and the parametric object decoder 830 are separated from each other. It is integrated into the individual decoder 910 and parametric decoder 920 of the, and the 3DA rendering unit 940 and the renderer interface unit 930 for a convenient and standardized interface has been added. The renderer interface unit 930 receives the user environment information, the renderer version, etc. from the 3DA renderer 940 existing inside or outside, and reproduces the information together with a channel or object signal having a form compatible thereto and displays the related information. Data can be delivered. The 3DA renderer interface 930 may include an order controller 1630, which will be described later.

The parametric decoder 920 needs a downmix signal to generate an object or channel signal, and the required downmix signal is decoded and input through the individual decoder 910. Encoders corresponding to the object and channel signal decoding systems may be of various types, and may be regarded as compatible encoders if at least one of the bit strings 801, 802, 803, 804, 901, 902, 903, and 904 shown in FIGS. 8 and 9 can be generated. Further, according to the present invention, the decoding system shown in Figs. 8 and 9 is designed to ensure compatibility with past systems or bit strings. For example, when an AAC-coded individual channel bit string is input, it may be decoded through an individual (channel) decoder and sent to the 3DA renderer. In the case of the MPS (MPEG Surround) bit stream, the downmix signal is sent together with the downmix signal. After downmixing, the AAC encoded signal is decoded through a separate (channel) decoder to be transmitted to the parametric channel decoder. It works like a surround decoder. The same applies to bit strings encoded with SAOC (Spatial Audio Object Coding). In the case of SAOC, in the system of FIG. 8, the SAOC is rendered as a channel through MPEG Surround after operating as a transcoder. To this end, the SAOC transcoder receives the reproduction channel environment information, and generates and transmits a channel signal optimized for this purpose. Therefore, while receiving and decoding a conventional SAOC bit string, it is possible to perform rendering specialized for a user or a reproduction environment. In the system of FIG. 9, when an SAOC bit string is input, instead of a transcoding operation of converting the SAOC bit string into an MPS bit string, the system is converted into an individual object type suitable for a channel or a rendering. Therefore, the amount of calculation is lower than that of the transcoding structure, which is advantageous in terms of sound quality. In FIG. 9, the output of the object decoder is expressed only as a channel, but may be transmitted to the renderer interface as a separate object signal. In addition, although only shown in FIG. 9, when the residual signal is included in the parametric bit string including the case of FIG. 8, decoding thereof is decoded through an individual decoder.

(Individual, parameter combination, residual to channel)

10 is a diagram illustrating a configuration of an encoder and a decoder according to another embodiment of the present invention.

10 illustrates a structure for scalable coding when the speaker setups of the decoders are different.

The encoder includes a downmixer 210, the decoder includes a demultiplexer 220, and includes one or more of the first decoder 230 to the third decoder 250.

The downmixing unit 210 downmixes the input signal CH_N corresponding to the multichannel to generate the downmix signal DMX. In this process, one or more of the upmix parameter UP and the upmix residual UR are generated. Then, by multiplexing the downmix signal DMX, the upmix parameter UP (and the upmix residual UR), one or more bitstreams are generated and sent to the decoder.

Here, the upmix parameter UP is a parameter required for upmixing one or more channels into two or more channels, and may include a spatial parameter and an inter-channel phase difference (IPD).

The upmix residual UR corresponds to a residual signal that is a difference between the input signal CH_N, which is the original signal, and the restored signal, and the restored signal includes an upmix parameter UP in the downmix DMX. The upmixed signal may be applied, or the channel not downmixed by the downmixing unit 210 may be a signal encoded in a discrete manner.

The demultiplexer 220 of the decoder may extract the downmix signal DMX and the upmix parameter UP from one or more bitstreams, and further extract the upmix residual UR. Here, the residual signal may be encoded by a method similar to individual encoding of the downmix signal. Therefore, the decoding of the residual signal is characterized in that the system shown in FIG. 8 or 9 is performed through a separate (channel) decoder.

According to the speaker setup environment of the decoder, one (or more than one) of the first decoding unit 230 to the third decoding unit 250 may be selectively included. Depending on the type of device (smartphone, stereo TV, 5.1ch home theater, 22.2ch home theater, etc.), the setup environment of the loudspeaker may vary. Despite such various environments, if the bitstream and the decoder for generating the multi-channel signal such as 22.2ch are not selective, the 22.2ch signal must be restored and then downmixed again according to the speaker reproduction environment. In such a case, the amount of computation required for recovery and downmix is very high, and delay may occur.

However, according to another embodiment of the present invention, one or more of the first decoder to the third decoder (or more than one) according to the setup environment of each device can be eliminated, as described above.

The first decoder 230 is configured to decode only the downmix signal DMX and does not accompany an increase in the number of channels. When the downmix signal is mono, it outputs a mono channel signal, and if it is stereo, it outputs a stereo signal. It may be suitable for a device equipped with headphones having one or two speaker channels, a smartphone, a TV, and the like.

Meanwhile, the second decoder 240 receives the downmix signal DMX and the upmix parameter UP, and generates a parametric M channel PM based on the downmix signal DMX and the upmix parameter UP. If the number of channels increases compared to the first decoder, but only the upmix parameter UP has a parameter corresponding to the upmix up to the total M channels, the number of M channels less than the original channel number N can be reproduced. have. For example, the original signal which is an input signal of the encoder is a 22.2ch signal, and the M channel may be a 5.1ch, 7.1ch channel, or the like.

The third decoder 250 receives not only the downmix signal DMX and the upmix parameter UP but also the upmix residual UR. While the second decoder generates the parametric channel of the M channel, the third decoder may further apply the upmix residual signal UR to the N-channel recovered signal.

Each device optionally includes one or more of a first decoder and a third decoder, and selectively parses upmix parameters (UP) and upmix residuals (UR) in the bitstream, thereby providing a signal suitable for each speaker setup environment. By creating it immediately, complexity and computations can be reduced.

(Object Waveform Coding Considering Masking)

A waveform encoder of an object according to the present invention (hereinafter, referred to as a waveform encoder) refers to a case in which a channel or an object audio signal is encoded such that each channel or an object can be independently decoded, and a concept corresponding to parametric encoding / decoding. Also referred to as discrete encoding / decoding), bit allocation takes into account the position of the object's sound scene. This uses the psychoacoustic BMLD (Binaural Masking Level Difference) phenomenon and the characteristics of object signal coding.

In order to explain the BMLD phenomenon, the MS (Mid-Side) stereo coding used in the conventional audio coding method is described as follows. In other words, the masking phenomenon in psychoacoustic sound is possible when the masker generating masking and the masking masking are in the same spatial direction. If the correlation between the two channel audio signals of the stereo audio signal is very high and the magnitude is the same, the image (sound) of the sound is centered between the two speakers. If there is no correlation, independent sound is generated from each speaker and the image is different. Bear on the speaker. If each channel is independently encoded (dual mono) for the input signal with maximum correlation, the quantization noise in each channel does not correlate with each other. Therefore, the audio signal is centered and the quantization noise is different from each speaker. Will be set aside. Therefore, the quantization noise that should be masquerade is not masked due to spatial inconsistency, and thus, a problem that sounds human to distortion occurs. In order to solve this problem, the summation encoding generates a signal (Mid signal) and a subtraction signal (Difference) of two channel signals, performs a psychoacoustic model using the quantization noise, and quantizes the generated quantization noise. Be in the same position as the sound image.

In the conventional channel coding, each channel is mapped to a speaker to be reproduced, and since the position of the speaker is fixed and separated from each other, masking between channels cannot be considered. However, in the case of encoding each object independently, whether to mask or not may be different depending on the positions of the objects in the sound scene. Therefore, it is preferable to determine whether to mask the object currently encoded by another object and to allocate bits accordingly.

FIG. 11 shows masking thresholds 1130 for the respective signals for Object 1 1110 and Object 2 1120, the masking thresholds that can be obtained from these signals, and the combined signal for Object 1 and Object 2. If object 1 and object 2 are considered to be at least the same position relative to the listener's position, or within a range that would not cause problems with BMLD, the object will be masked by the signal to the listener, such as 1130. The S2 signal included in 1 will be completely masked and inaudible. Therefore, in the process of encoding the object 1, it is preferable to encode in consideration of the masking threshold for the object 2. Since the masking thresholds add up to each other, they can be obtained by adding the respective masking thresholds for the object 1 and the object 2. Alternatively, the process of calculating the masking threshold itself is very expensive, and it is also preferable to encode the object 1 and the object 2 by calculating a masking threshold using a signal generated by combining the object 1 and the object 2 in advance. 12 is an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention.

Another masking threshold calculation method according to the present invention is a masking level in consideration of the extent to which two objects fall in space instead of adding a masking threshold for two objects when the positions of the two object signals are not completely matched on the basis of the acoustic angle. It is also possible to attenuate and reflect. That is, when the masking threshold for object 1 is M1 (f) and the masking threshold for object 2 is M2 (f), the final joint masking thresholds M1 '(f) and M2' (f) to be used to encode each object. Is created to have the following relationship.

[Equation 1]

M1 ’(f) = M1 (f) + A (f) M2 (f)

M2 '(f) = A (f) M1 (f) + M2 (f)

In this case, A (f) is an attenuation factor generated by the position and distance in the space between two objects and the properties of the two objects, and has a range of 0.0 = <A (f) = <1.0.

The resolution of the direction of the person has a characteristic of worsening as it goes to the left and right with respect to the front, and worse as it goes to the back. Therefore, the absolute position of the object may serve as another factor for determining A (f).

In another embodiment according to the present invention, one of two objects may be implemented by using only its own masking threshold and only another object to obtain a masking threshold for the counterpart. Each of these is called an independent object. An object using only its own masking threshold is encoded in high quality regardless of a relative object, and thus, sound quality may be preserved even when rendering that is spatially separated from the object. When object 1 is an independent object and object 2 is a dependent object, a masking threshold may be expressed as follows.

[Equation 2]

M1 ’(f) = M1 (f)

M2 '(f) = A (f) M1 (f) + M2 (f)

Whether the independent object and the dependent object are additional information about each object is preferably transmitted to the decoding and renderer.

In another embodiment according to the present invention, when two objects are somewhat similar in space, it is possible to process not only the masking thresholds by adding them together but also the signals themselves into one object.

In another embodiment according to the present invention, in particular, in the case of performing parameter coding, it is preferable to combine the two signals into one object in consideration of the correlation between the two signals and the position in the space of the two signals.

(Transcoding features)

In another embodiment according to the invention, when transcoding a bitstream comprising a coupled object, in particular when transcoding at a lower bit rate, it is necessary to reduce the number of objects in order to reduce the data size, i.e. When downmixing an object into one and expressing it as one object, it is preferable to express the coupled object as one object.

In the above-described encoding through object-to-object coupling, for convenience of description, only the case of coupling two objects is described as an example. However, coupling to two or more objects may be implemented in a similar manner.

(Requires flexible rendering)

Among the technologies required for 3D audio, flexible rendering is one of the major challenges to be able to achieve the highest quality of 3D audio. It is well known that the location of 5.1-channel speakers is very irregular depending on the living room structure and furniture layout. Even if the speaker exists in such a non-standard position, the content creator should be able to provide the intended sound scene. In order to do this, the user needs to know the speaker environment in each playback environment and the difference in the position according to the standard. A rendering technique is needed to correct this. In other words, the decoding of the transmitted bit string according to the decoding method does not end the role of the codec, but a series of techniques for the process of optimizing and transforming it to the user's playback environment is required.

FIG. 13 shows an arrangement (grey, 1310) and an arrangement (red, 1320) according to the ITU-R Recommendation for a 5.1 channel setup. In a living room environment, such a problem may arise that both the direction angle and distance differ from the ITU-R Recommendation. (Although not shown, there may be differences in the height of the speakers.) If the original channel signal is reproduced in this changed speaker position, it is difficult to provide an ideal 3D sound scene.

(Flexible rendering)

Using Amplitude Panning, which determines the direction of sound source between two speakers based on the size of the signal, or VBAP (Vector-Based Amplitude Panning), which is widely used to determine the direction of sound source using three speakers in three-dimensional space, It can be seen that flexible rendering can be implemented relatively conveniently with respect to the object signal transmitted for each object. One of the advantages of transmitting object signals instead of channels.

(Object Decoding and Rendering Structure)

14 illustrates structures 1400 and 1401 of two embodiments in which a decoder for an object bit string and a flexible rendering system using the same are connected according to the present invention. As described above, the object has an advantage of easily positioning the object as a sound source in accordance with a desired sound scene. In this case, the mix (Mix, 1420) receives the position information represented by the mixing matrix and changes the channel signal. That is, the positional information on the sound scene is expressed as relative information from the speaker corresponding to the output channel. At this time, if the actual number and location of the speaker does not exist in the predetermined position it is necessary to re-render using the corresponding position information (Speaker Config). As described below, rendering a channel signal back to another form of channel signal is more difficult than rendering an object directly to the final channel.

15 illustrates a structure of another embodiment implementing decoding and rendering on an object bit string according to the present invention. Compared to the case of FIG. 14, the flexible rendering 1510 suitable for the final speaker environment is directly implemented together with decoding from the bit string. That is, instead of going through two steps of mixing based on the mixing matrix and performing the process of rendering to the flexible speaker from the generated stereotyped channel, a rendering matrix or speaker location information 1520 is used. Rendering parameters are generated and used to render the object signal directly to the target speaker.

(Flexible rendering by pasting into channels)

On the other hand, when a channel signal is transmitted as an input, when the location of the speaker corresponding to the corresponding channel is changed to an arbitrary position, it is difficult to implement the same panning technique in the case of an object, and a separate channel mapping process is required. The problem is that the process and solution required for rendering the object signal and the channel signal are different. Therefore, when the object signal and the channel signal are transmitted at the same time to produce a sound scene in which the two signals are mixed, there is a mismatch in the space. Distortion is likely to occur. In order to solve such a problem, another embodiment according to the present invention performs a flexible rendering on the channel signal after performing a mix on the channel signal without performing the flexible rendering on the object separately. Rendering using HRTF is preferably implemented in the same manner.

(Decoding stage downmix: parameter transfer or auto-generation)

In the case of downmix rendering, when multichannel content is played through fewer output channels, it has conventionally been implemented as an MN downmix matrix (M is the number of input channels and N is the number of output channels). That is, when the 5.1 channel content is reproduced in stereo, the downmix is performed by the given equation. However, in the downmix implementation method, even though the user's playback speaker environment is only 5.1 channel, a problem of computation amount that needs to decode all bit strings corresponding to the transmitted 22.2 channels occurs. If the 22.2 channel signals must be decoded in order to generate stereo signals for playback on a portable device, the computational burden is very high and a huge amount of memory waste (storing 22.2 channel decoded audio signals) occurs.

(Transcoding to Downmix Alternatives)

As an alternative, one can think of how to convert from a huge 22.2 channel original bitstream to an effective number of bitstreams that are appropriate for the target device or target playback space. For example, in the case of 22.2 channel content stored in a cloud server, a scenario of receiving playback environment information from a client terminal, converting it accordingly, and transmitting the same may be implemented.

(Decoding sequence or downmix sequence; sequence control unit)

On the other hand, in a scenario in which the decoder and the rendering are separated, for example, it may be necessary to decode 50 object signals along with an audio signal of 22.2 channels and transmit them to the renderer. Since the signal is a data rate, there is a problem that a very large bandwidth is required between the decoder and the renderer. Therefore, it is not desirable to transmit such a large amount of data at once, and it is desirable to have an effective transmission plan. And, it is preferable that the decoder determines the decoding order and transmits accordingly. FIG. 16 is a block diagram illustrating a structure for determining and transmitting a transmission plan between a decoder and a renderer as described above.

The order controller 1630 receives the additional information and metadata acquired through the decoding of the bit string, the reproduction environment, the rendering information, etc. from the renderer 1620, and transmits the decoding order and the decoded signal to the renderer 1620, and It determines the unit and the like and delivers the determined control information back to the decoder 1610 and renderer 1620. For example, if the renderer 1620 instructs the specific object to be completely removed, the object is not only required to be transmitted to the renderer 1620, nor need to be decrypted. Alternatively, in a situation where specific objects are rendered only in a specific channel, the transmission band will be reduced by downmixing and transmitting the corresponding object in advance instead of separately transmitting the corresponding object. In another embodiment, when the sound scenes are spatially grouped and transmitted together with signals necessary for rendering for each group, the amount of signals that need to be unnecessarily waited in the renderer internal buffer may be minimized. The amount of data that can be accommodated at one time may vary depending on the renderer 1620. The information may also be notified to the sequence controller 1630, and the decoder 1610 may determine the decoding timing and transmission amount accordingly.

On the other hand, the decoding control by the order controller 1630 is further transferred to the encoding stage, it is possible to control the encoding process. That is, it is possible to exclude unnecessary signals during encoding or to determine grouping of objects and channels.

(Voice Highway)

Meanwhile, an object corresponding to voice corresponding to bidirectional communication may be included in the bit string. Bidirectional communication is very sensitive to time delay unlike other contents, so if an object or channel signal is received, it should be transmitted to the renderer first. The corresponding object or channel signal may be indicated by a separate flag. First of all, the transport object has an independent characteristic in presentation time from other object / channel signals included in the same frame unlike other objects / channels.

(AV Matching and Phantom Center)

One of the new problems that arise when considering UHDTV, or ultra-high definition TV, is what is often called Near Field. In other words, considering the viewing distance of a general user environment (living room), the distance from the speaker to be reproduced is shorter than the distance between each speaker, so that each speaker operates as a point source and a large and large screen. In the absence of a speaker in the center, high-quality 3D audio service is possible only when the spatial resolution of the sound object synchronized to the video is very high.

In a conventional 30-degree audio visual system, stereo speakers disposed at the left and right sides are not placed in a near field situation, and are sufficient to provide a sound scene for moving an object on the screen (for example, a car moving from left to right). However, in a UHDTV environment with an audio visual angle of 100 degrees, not only the left and right resolutions but additional resolutions constituting the top and bottom of the screen are required. For example, if there are two characters on the screen, the current HDTV did not seem to be a big problem in reality even if both sounds were uttered in the middle, but in UHDTV size, the screen and the corresponding sound mismatch Will be perceived as a new form of distortion.

One solution to this problem is in the form of a 22.2 channel speaker configuration. 2 is an example of a 22.2 channel arrangement. According to FIG. 2, a total of eleven speakers are disposed in the front part to greatly increase the spatial resolution of the front left and right and top and bottom. Place five speakers in the middle floor, which was previously handled by three speakers. In addition, three upper layers and three lower layers were added to sufficiently cope with the height of the sound. Using this arrangement, the spatial resolution of the front side is higher than before, so it will be advantageous to match the video signal. However, in current TVs using display elements such as LCDs and OLEDs, there is a problem that the display occupies a position where a speaker should exist. That is, there is a problem of providing a matched sound at each object position in the screen by using speakers existing outside the display area unless the display itself provides a sound or a device characteristic penetrating the sound. In FIG. 2, the speakers corresponding to the minimum FLc, FC, and FRc are disposed at positions overlapping the display.

FIG. 17 is a conceptual view illustrating a concept of reproducing speakers absent by a display among front-facing speakers in a 22.2 channel system by using peripheral channels thereof. It may also be considered to place additional speakers in the upper and lower periphery of the display, such as the circles indicated by dashed lines to correspond to the FLc, FC, and FRc members. According to FIG. 17, there may be seven peripheral channels that can be used to generate FLc. Using these seven speakers, a virtual source can be generated to reproduce sound corresponding to the absence speaker position.

You can use technology and properties such as VBAP or HAAS Effect to create a virtual source using the surrounding speakers. Alternatively, different panning techniques may be applied according to frequency bands. Furthermore, the azimuth angle change and height adjustment using HRTF can be considered. For example, in case of replacing FC using BtFC, an HRTF having a synergistic property may be applied to add the FC channel signal to the BtFC. The nature of HRTF observations is that in order to control the sound level, you need to control the position of certain nulls in the high frequency band (which depends on the person). However, in order to generalize different nulls according to people, height adjustment may be implemented by increasing or decreasing the high frequency band. Using this method has the disadvantage of causing distortion in the signal instead of the influence of the filter.

The processing method for arranging the sound source at the member speaker position according to the present invention is as shown in FIG. According to FIG. 18, a channel signal corresponding to a phantom speaker position is used as an input signal, and the input signal passes through a subband filter unit 1810 divided into three bands. It may be implemented by a method without a speaker array. In this case, instead of dividing into two bands instead of three bands or dividing into three bands, the second two bands may be processed differently. The first band is a signal that can be reproduced through a woofer or subwoofer because it is preferable to play through a large speaker instead of being relatively insensitive to the low frequency band. In this case, the first band signal adds a time delay 1820 to use the preceding effect. In this case, the time delay is not intended to compensate for the time delay of the filter occurring in the processing in the other band, but provides an additional time delay to be played later than other band signals, that is, to provide a preceding effect.

The second band is a signal that will be used to reproduce through the speaker around the phantom speaker (the bezel of the TV display and the speaker disposed around the speaker), which is divided into at least two speakers and reproduced by applying a panning algorithm 1830 such as VBAP. Coefficients are generated and applied. Therefore, the panning effect can be improved by accurately providing the number and position of the speaker where the second band output is reproduced (relative to the phantom speaker). In this case, in addition to VBAP panning, it is also possible to apply a filter considering HRTF or to apply a different phase filter or time delay filter to provide a time panning effect. Another advantage of applying the HRTF by dividing the band is that it can be limited in the band to handle the range of signal distortion caused by the HRTF.

The third band is for generating a reproduced signal using the speaker array, if there is a speaker array, it is possible to apply the array signal processing technique 1840 for the virtualization of the sound source through at least three speakers. Alternatively, coefficients generated through Wave Field Synthesis (WFS) may be applied. At this time, the third band and the second band may be actually the same band.

19 illustrates an embodiment of mapping a signal generated in each band to a speaker arranged around a TV. According to FIG. 19, the number and location information of the speakers corresponding to the second and third bands should be at a relatively precisely defined location, and the location information is preferably provided to the processing system of FIG. 18.

20 is a diagram illustrating a relationship between products in which an audio signal processing device according to an embodiment of the present invention is implemented. First, referring to FIG. 20, the wired / wireless communication unit 310 receives a bitstream through a wired / wireless communication method. Specifically, the wired / wireless communication unit 310 may include at least one of a wired communication unit 310A, an infrared communication unit 310B, a Bluetooth unit 310C, and a wireless LAN communication unit 310D.

The user authentication unit 320 receives user information and performs user authentication. The user authentication unit 320 includes one or more of a fingerprint recognition unit 320A, an iris recognition unit 320B, a face recognition unit 320C, and a voice recognition unit 320D. The fingerprint, iris information, facial contour information, and voice information may be input, converted into user information, and the user authentication may be performed by determining whether the user information matches the existing registered user data. .

The input unit 330 is an input device for a user to input various types of commands, and may include one or more of a keypad unit 330A, a touch pad unit 330B, and a remote controller unit 330C. It is not limited.

The signal coding unit 340 encodes or decodes an audio signal and / or a video signal received through the wired / wireless communication unit 310, and outputs an audio signal of a time domain. Audio signal processing apparatus 345, which corresponds to an embodiment of the present invention (i.e., decoder 600 according to one embodiment and encoder and decoder 1400 according to another embodiment). As such, the audio processing apparatus 345 and the signal coding unit including the same may be implemented by one or more processors.

The control unit 350 receives input signals from the input devices and controls all processes of the signal decoding unit 340 and the output unit 360. The output unit 360 is a component in which an output signal generated by the signal decoding unit 340 is output, and may include a speaker unit 360A and a display unit 360B. When the output signal is an audio signal, the output signal is output to the speaker, and when the output signal is a video signal, the output signal is output through the display.

The audio signal processing method according to the present invention can be stored in a computer readable recording medium that is produced as a program for execution on a computer, and multimedia data having a data structure according to the present invention can also be stored in a computer readable recording medium. Can be stored. The computer readable recording medium includes all kinds of storage devices for storing data that can be read by a computer system. Examples of computer-readable recording media include ROMs, RAMs, CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, and the like, which are also implemented in the form of carrier waves (for example, transmission over the Internet). Include. In addition, the bitstream generated by the encoding method may be stored in a computer-readable recording medium or transmitted using a wired / wireless communication network.

As described above, although the present invention has been described by way of limited embodiments and drawings, the present invention is not limited thereto and is intended by those skilled in the art to which the present invention pertains. Of course, various modifications and variations are possible within the scope of equivalents of the claims to be described.

210: upper layer
220: middle layer
230: floor layer
240: LFE channel
410: first object signal group
420: second object signal group
430: celadon
520: Down Mixer & Parameter Encoder 1
550: object grouping unit
560: waveform encoder
620: waveform decoder
630: Upmixer & Parameter Decoder 1
670: object degrouping unit
860: 3DA decoder
870: 3DA rendering unit (870)
1110: masking threshold curve of object 1
1120: masking threshold curve of object 2
1130: masking critical curve of object 3
1230: psychoacoustic model
1310: 5.1-channel loudspeaker layout according to ITU-R Recommendations
1320: When 5.1-channel loudspeakers are placed in an arbitrary position
1610: 3D audio decoder
1620: 3D Audio Renderer
1630: sequence control unit
1810: subband filter unit
1820: time delay filter unit
1830: panning algorithm
1840: speaker array control unit

Claims (8)

  1. delete
  2. delete
  3. delete
  4. Receiving a first signal for a first object signal group including a plurality of object signals and a second signal for a second object signal group including a plurality of object signals;
    Receiving first metadata for the first object signal group and second metadata for the second object signal group;
    Reproducing an object signal belonging to the first object signal group by using the first signal and the first metadata;
    Reproducing an object signal belonging to the second object signal group by using the second signal and second metadata; And
    Receiving global gain information;
    Each of the metadata includes location information of an object corresponding to an object signal included in the corresponding object signal group;
    If the position of the object is a time-varying dynamic object, the position information is a position value relative to the position value of the previous object in time,
    The first object signal group and the second object signal group includes a signal mixed to form a sound scene,
    The first metadata and the second metadata are received from one bit string,
    The global gain information is a gain value applied to both the first object signal group and the second object signal group.
  5. The method of claim 4, wherein
    Obtaining downmix gain information on at least one object signal belonging to a first object signal group from the first metadata and reproducing the at least one object signal using the downmix gain information.
  6. delete
  7. The method of claim 4, wherein
    And the position information of the dynamic object is indicative of a position value relative to the position value of the previous object in time.
  8. Generating a first signal for a first object signal group including a plurality of object signals and a second signal for a second object signal group including a plurality of object signals;
    Generating first metadata of the first object signal group and second metadata of the second object signal group; And
    Generating global gain information;
    The object signal belonging to the first object signal group is reproduced using the first signal and the first metadata, and the object signal belonging to the second object signal group is reproduced using the second signal and the second metadata. Become,
    Each of the metadata includes location information of an object corresponding to an object signal included in the corresponding object signal group;
    If the position of the object is a time-varying dynamic object, the position information is a position value relative to the position value of the previous object in time,
    The first object signal group and the second object signal group includes a signal mixed to form a sound scene,
    The first metadata and the second metadata are generated in one bit string,
    And the global gain information is a gain value applied to both the first object signal group and the second object signal group.
KR1020120084231A 2012-07-31 2012-07-31 Apparatus and method for audio signal processing KR102059846B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020120084231A KR102059846B1 (en) 2012-07-31 2012-07-31 Apparatus and method for audio signal processing

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
KR1020120084231A KR102059846B1 (en) 2012-07-31 2012-07-31 Apparatus and method for audio signal processing
EP13825888.4A EP2863657B1 (en) 2012-07-31 2013-07-26 Method and device for processing audio signal
US14/414,910 US9564138B2 (en) 2012-07-31 2013-07-26 Method and device for processing audio signal
CN201380039768.3A CN104541524B (en) 2012-07-31 2013-07-26 A kind of method and apparatus for processing audio signal
PCT/KR2013/006732 WO2014021588A1 (en) 2012-07-31 2013-07-26 Method and device for processing audio signal
JP2015523022A JP6045696B2 (en) 2012-07-31 2013-07-26 Audio signal processing method and apparatus
US15/383,293 US9646620B1 (en) 2012-07-31 2016-12-19 Method and device for processing audio signal

Publications (2)

Publication Number Publication Date
KR20140017344A KR20140017344A (en) 2014-02-11
KR102059846B1 true KR102059846B1 (en) 2020-02-11

Family

ID=50266006

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020120084231A KR102059846B1 (en) 2012-07-31 2012-07-31 Apparatus and method for audio signal processing

Country Status (1)

Country Link
KR (1) KR102059846B1 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100062784A (en) * 2008-12-02 2010-06-10 한국전자통신연구원 Apparatus for generating and playing object based audio contents
KR101227932B1 (en) * 2011-01-14 2013-01-30 전자부품연구원 System for multi channel multi track audio and audio processing method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Jonas Engdegard, et al. Spatial audio object coding (SAOC) - The upcoming MPEG standard on parametric object based audio coding. Audio Engineering Society Convention 124. 2008.05.20.*

Also Published As

Publication number Publication date
KR20140017344A (en) 2014-02-11

Similar Documents

Publication Publication Date Title
JP6486995B2 (en) Audio content authoring and rendering method and apparatus
JP6335241B2 (en) Method and apparatus for encoding and decoding a series of frames of an ambisonic representation of a two-dimensional or three-dimensional sound field
US9792918B2 (en) Methods and apparatuses for encoding and decoding object-based audio signals
US9449601B2 (en) Methods and apparatuses for encoding and decoding object-based audio signals
RU2661775C2 (en) Transmission of audio rendering signal in bitstream
US20170125030A1 (en) Spatial audio rendering and encoding
JP6105062B2 (en) System, method, apparatus and computer readable medium for backward compatible audio encoding
Herre et al. MPEG-H audio—the new standard for universal spatial/3D audio coding
KR101759005B1 (en) Loudspeaker position compensation with 3d-audio hierarchical coding
JP6510021B2 (en) Audio apparatus and method for providing audio
US10231073B2 (en) Ambisonic audio rendering with depth decoding
TWI674009B (en) Method and apparatus for decoding encoded hoa audio signals
JP5646699B2 (en) Apparatus and method for multi-channel parameter conversion
US9165558B2 (en) System for dynamically creating and rendering audio objects
Herre et al. MPEG-H 3D audio—The new standard for coding of immersive spatial audio
US9761229B2 (en) Systems, methods, apparatus, and computer-readable media for audio object clustering
KR20170109023A (en) Systems and methods for capturing, encoding, distributing, and decoding immersive audio
RU2618383C2 (en) Encoding and decoding of audio objects
JP2016524726A (en) Perform spatial masking on spherical harmonics
US9794686B2 (en) Controllable playback system offering hierarchical playback options
US9530421B2 (en) Encoding and reproduction of three dimensional audio soundtracks
JP5185337B2 (en) Apparatus and method for generating level parameters and apparatus and method for generating a multi-channel display
KR101358700B1 (en) Audio encoding and decoding
TWI549527B (en) Apparatus and method for generating audio output signals using object based metadata
CN101356573B (en) Control for decoding of binaural audio signal

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
AMND Amendment
E601 Decision to refuse application
X091 Application refused [patent]
AMND Amendment
X701 Decision to grant (after re-examination)