KR20140128565A - Apparatus and method for audio signal processing - Google Patents
Apparatus and method for audio signal processing Download PDFInfo
- Publication number
- KR20140128565A KR20140128565A KR20130047059A KR20130047059A KR20140128565A KR 20140128565 A KR20140128565 A KR 20140128565A KR 20130047059 A KR20130047059 A KR 20130047059A KR 20130047059 A KR20130047059 A KR 20130047059A KR 20140128565 A KR20140128565 A KR 20140128565A
- Authority
- KR
- South Korea
- Prior art keywords
- signal
- channel
- speaker
- information
- speakers
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S5/00—Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation
- H04S5/005—Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation of the pseudo five- or more-channel type, e.g. virtual surround
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/11—Positioning of individual sound objects, e.g. moving airplane, within a sound field
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Stereophonic System (AREA)
Abstract
According to an aspect of the present invention, there is provided a method of processing an audio signal, comprising: obtaining position information of each of a plurality of composite speakers; Obtaining a crossover frequency of the composite speaker; Receiving a bit string including an object signal and object position information; Decoding the object signal and the object location information from the received bit stream; Calculating energy distribution information of the decoded object signal; Selecting two or more composite loudspeakers among the plurality of composite loudspeakers using the decoded object position information; Generating an output gain value of the selected two or more composite speakers using the crossover frequency and the energy distribution of the object signal; And rendering the decoded object signal into a plurality of channel signals using the generated gain value.
Description
The present invention relates to a method and apparatus for processing an object audio signal, and more particularly, to a method and apparatus for encoding and decoding an object audio signal or rendering the object audio signal in a three-dimensional space.
3D audio is a series of signal processing to provide a lively sound in a three-dimensional space, by providing another dimension in the height direction on a horizontal sound scene (2D) provided by existing surround audio, Transmission, encoding, reproduction technology, and the like. Particularly, in order to provide 3D audio, a rendering technique is widely required in which an image is formed at a virtual position where a speaker is not used even if a larger number of speakers are used or a smaller number of speakers are used.
3D audio is expected to become an audio solution for future high-definition TVs (UHDTVs), and it is expected that 3D audio will be able to be used in a variety of applications such as sound in vehicles evolving into high-quality infotainment space as well as theater sound, personal 3DTV, tablet, Games and so on.
For 3D audio, it is necessary to transmit signals of more than 22.2 channels, which is a conventional compression transmission technique. In the case of conventional high-quality encoding such as MP3, AAC, DTS, and AC3, it is optimized to transmit only channels less than 5.1 channels.
In addition, in order to reproduce 22.2 channel signals, an infrastructure for a listening space in which 24 speaker systems are installed is required. In short, it is not easy to spread to the market, so a technology for effectively reproducing 22.2 channel signals in a space with a smaller number of speakers , A technique that allows the reproduction of a conventional stereo or 5.1 channel sound source in a larger number of speakers, 10.1 channel and 22.2 channel environment, and also a sound provided by the original sound source A technique for providing a scene, and a technique for enabling 3D sound to be enjoyed in a headphone listening environment. Such techniques are referred to herein as collective rendering and are referred to in detail as downmix, upmix, flexible rendering, binaural rendering, and the like.
On the other hand, an object-based signal transmission scheme is needed as an alternative for efficiently transmitting such a sound scene. It is more advantageous to transmit on an object basis than on a channel-based transmission according to a sound source. In addition, when transmitting on an object basis, the user can arbitrarily control the reproduction size and position of the objects, . Accordingly, there is a need for an effective transmission method capable of compressing object signals at a high transmission rate.
In addition, a sound source in which the channel-based signal and the object-based signal are mixed may exist, thereby providing a new type of listening experience. Accordingly, there is a need for a technique for effectively transmitting a channel signal and an object signal together and rendering the same effectively.
Finally, depending on the specificity of the channel and the speaker environment at the playback stage, exception channels that are difficult to reproduce in the conventional manner may occur. In this case, there is a need for a technique for effectively reproducing the exception channel based on the speaker environment at the reproduction end.
According to an aspect of the present invention, there is provided a method of reproducing an audio signal including an object signal using a composite speaker, the method comprising: receiving a crossover frequency of the composite speaker; Receiving position information of the composite speaker; Receiving a bit string including an object signal and object position information; Decoding the object signal and the object location information from the received bit stream; Calculating an energy distribution of the decoded object signal; Selecting two or more composite loudspeakers using the decoded object position information; Generating an output gain value of the selected two or more composite speakers using the crossover frequency and the energy distribution of the object signal; And rendering the decoded object signal as a channel signal using the generated gain value.
According to the present invention, virtual speakers are generated using two different composite speakers. This virtual loudspeaker has the effect of more precisely orienting the sound image than the existing sound image localization method. The effects of the present invention are not limited to the above-mentioned effects, and the effects not mentioned can be clearly understood by those skilled in the art from the present specification and the accompanying drawings.
1 is a view for explaining viewing angles according to image sizes at the same viewing distance
2 is a diagram showing a configuration of a speaker arrangement of 22.2 channels
FIG. 3 is a conceptual diagram showing the position of each sound object on the listening space in which the listener listens to 3D audio.
FIG. 4 is an exemplary configuration diagram of an object signal group formed using the grouping method according to the present invention with respect to the objects shown in FIG.
5 is a block diagram of an embodiment of an object audio signal encoder according to the present invention.
6 is an exemplary configuration diagram of a decoding apparatus according to an embodiment of the present invention.
FIG. 7 is a block diagram of an embodiment of a bit string generated by coding by the encoding method according to the present invention
8 is a block diagram of an object and channel signal decoding system according to an embodiment of the present invention.
9 is a block diagram of another object and channel signal decoding system according to the present invention
10 is a block diagram of an embodiment of a decoding system according to the present invention
11 is a view for explaining masking thresholds for a plurality of object signals according to the present invention;
12 is a block diagram of an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention
13 is a diagram for explaining the arrangement according to the ITU-R recommendation for the 5.1 channel setup and the case where it is arranged at an arbitrary position;
FIG. 14 is a block diagram illustrating a structure of an embodiment in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other.
15 is a block diagram illustrating a structure of another embodiment that implements decoding and rendering of an object bit stream according to the present invention.
16 is a diagram showing a structure for determining and transmitting a transmission plan between a decoder and a renderer
17 is a conceptual diagram for explaining a concept of reproducing speakers absent by the display among the front-mounted speakers in the 22.2 channel system using the surrounding channels
18 is a flowchart illustrating a method of processing a sound source according to an embodiment of the present invention;
19 is a diagram illustrating an example of mapping a signal generated in each band to a speaker disposed in the vicinity of the TV
Figure 20 shows an embodiment of the frequency response of a mixed speaker
21 is a diagram showing an example of generating a virtual speaker
22 is a diagram showing a structure for generating a virtual speaker;
23 is a view showing a relationship among products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to be illustrative of the present invention and not to limit the scope of the invention. Should be interpreted to include modifications or variations that do not depart from the spirit of the invention. The terms and accompanying drawings used herein are for the purpose of facilitating the present invention and the shapes shown in the drawings are exaggerated for clarity of the present invention as necessary so that the present invention is not limited thereto And are not intended to be limited by the terms and drawings. In the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In the present invention, the following terms can be interpreted according to the following criteria, and terms not described may be construed in accordance with the following. Coding can be interpreted as encoding or decoding as occasion demands, and information is a term that includes all of values, parameters, coefficients, elements, and the like, But the present invention is not limited thereto.
According to an aspect of the present invention, there is provided a method of processing an audio signal, the method comprising: receiving a bit stream including an object signal and object position information; Decoding the object signal and the object position information using the received bit stream; Receiving past object position information on a storage medium; Generating an object movement path using the received past object position information and the decoded object position information; Generating a variable gain value according to time using the generated object movement path; Generating a modified variable gain value using the generated variable gain value and the weighting function; And generating a channel signal from the decoded object signal using the modified variable gain value.
In addition, the weighting function may be changed based on a physiological characteristic of the user.
In addition, the physiological characteristic may be extracted using an image or an image.
In addition, the physiological characteristic may include at least one of information on the shape of the head, the body of the user, and the shape of the external ear.
Hereinafter, a method and apparatus for processing an object audio signal according to an embodiment of the present invention will be described.
1 is a view for explaining viewing angles according to image sizes (e.g., UHDTV and HDTV) on the same viewing distance. Display technology has been developed and the size of the image has been increasing in accordance with the demand of the consumer. As shown in FIG. 1, UHDTV (7680 * 4320 pixel image, 110) is about 16 times larger than that of HDTV (1920 * 1080 pixel image, 120). If the HDTV is installed on the living room wall and the viewer is sitting on the living room sofa at a certain viewing distance, the viewing angle may be about 30 degrees. However, when UHDTV is installed in the same viewing distance, the viewing angle reaches about 100 degrees. When a large screen of high resolution and high resolution is installed as described above, it may be desirable to provide a sound having high sense of presence and impact suitable for the large content. In order to provide a viewer with almost the same environment as in the scene, the presence of one or two surround channel speakers may not be sufficient. Thus, a multi-channel audio environment having a larger number of speakers and channels may be required.
In addition to the home theater environment, there are also personal 3D TVs, smartphone TVs, 22.2-channel audio programs, cars, 3D videos, telepresence rooms, cloud-based gaming, Can be.
2 is a diagram showing a speaker arrangement of 22.2 channels as an example of a multi-channel. 22.2ch may be an example of a multi-channel environment for enhancing the sound field, and the present invention is not limited to a specific number of channels or a specific speaker arrangement. Referring to FIG. 2, a total of nine channels may be provided in the
In this manner, a high computation amount may be required for transmitting and reproducing multi-channel signals up to several tens of channels. Also, a high compression ratio may be required in consideration of a communication environment and the like. In addition, many households do not have a multi-channel (eg, 22.2-ch) speaker environment and many listeners have a 2-channel or 5.1-channel setup. When the multi-channel is converted and re-converted into 2-channel and 5.1-channel, it is necessary to store 22.2-channel PCM signals as well as communication inefficiency, resulting in inefficiency in memory management.
3 is a conceptual diagram showing positions of
FIG. 4 shows that the
The first method of forming a group is to form a group of nearby objects considering the position of each object on a sound scene. The
5 is a block diagram of an embodiment of an object audio signal encoder including an object grouping 550 and a
FIG. 6 is a block diagram illustrating an embodiment of performing decoding on a coded and transmitted signal. In the decoding process, a plurality of downmix signals decoded by the
If the transmitted bitstream includes a global gain and an object group gain, they can be applied to restore the size of a normal object signal. On the other hand, these gain values can be controlled in the rendering or transcoding process, and the gain of the entire signal can be controlled through the global gain control, and the group gain can be controlled through the object group gain. For example, when object grouping is performed on the basis of the playback speaker, the object group gain can be easily adjusted by adjusting the gain in order to implement flexible rendering, which will be described later.
At this time, although a plurality of parameter encoders or decoders are illustrated as being processed in parallel for convenience of explanation, it is also possible to sequentially perform encoding or decoding of a plurality of object groups through one system.
Another method of forming an object group is to group objects having low correlation into one group. This is a characteristic of the parameter encoding, and it takes into consideration that the objects having high correlation are difficult to separate each from the downmix. At this time, it is also possible to control the parameters such as the downmix gain and the like at the time of downmixing so that the grouped objects are further distanced from each other. At this time, the parameters used are preferably transmitted so that they can be used for signal restoration upon decoding.
Another method of forming an object group is to group objects having a high degree of correlation into one group. This is a method for increasing the compression efficiency in applications where the degree of utilization is not high, although it is difficult to separate the objects with high correlation using the parameters. In the case of a core codec, complex signals with various spectra require a lot of bits. Therefore, when a single core codec is used to group highly correlated objects, encoding efficiency is high.
Another method of forming an object group is to judge whether the object is masked or not and then encode it. For example, when the object A is in the relationship of masking the object B, the object B may be omitted in the encoding process when the two signals are included in one downmix and encoded into the core codec. In this case, when the object B is obtained by using the parameter at the decoding end, the distortion is large. Therefore, it is preferable that object A and object B having such a relationship are included in separate downmixes. On the other hand, when an object A and an object B are in a masking relationship but do not need to separately render two objects or at least do not need to separately process the masked object, conversely, . Therefore, the selection method may be different depending on the application. For example, if a particular object is masked or at least weak in a desirable sound scene in the encoding process, it may be implemented in a form that it is excluded from the object list and included in the masked object, or the two objects are combined and represented as one object .
Another way to form an object group is to separate things that are not point source objects, such as plane wave source objects or ambient source objects, and group them separately. Such sources need different compression encoding methods and parameters because of their different characteristics from point sources, and therefore, it is preferable to separate them separately.
The object information decoded for each group is reduced to the original objects through object grouping by referring to the transferred grouping information.
FIG. 7 shows an example of a bit string generated by encoding by the encoding method according to the present invention. Referring to FIG. 7, it can be seen that the main bitstream 700 in which the encoded channel or object data is transmitted is arranged in the order of the
(A method of allocating bits for each object group)
In generating a downmix for a plurality of groups and performing independent parametric object coding for each group, the number of bits used in each group may be different from each other. The criterion for allocating bits per group is the number of objects included in the group, the number of effective objects considering the masking effect among the objects in the group, the weight according to positions considering human spatial resolution, the sound pressure level of objects, The importance of the object on the scene, and the like can be considered. For example, if there are three spatial object groups A, B, and C, if the object signals of
(Main object and sub object position information encoding in object group)
On the other hand, in the case of object information, it is preferable to have a means for recommending according to an intention created by a producer, or transmitting mix information proposed by another user through metadata as position and size information of an object. In the present invention, this is referred to as preset information for convenience. In the case of position information through a preset, especially when the object is a dynamic object whose position changes with time, the amount of information to be transmitted is not small. For example, if you transmit location information that varies every frame for 1000 objects, it is a very large amount of data. Therefore, it is desirable to effectively transmit the position information of the object. Accordingly, the present invention uses an effective encoding method of position information using the main object and the sub-object definition.
The main object is an object that expresses the position information of the object as an absolute coordinate value in the three-dimensional space. The subobject refers to an object having positional information by expressing the position in the three-dimensional space with respect to the principal object. Therefore, the subordinate object needs to know what the corresponding main object is. In the case of grouping, in particular, when grouping is performed based on the position in the space, one main object and the rest are subordinate objects in the same group, . If there is no grouping for encoding or if it is not advantageous to encode sub-object location information, a separate set for location information encoding may be formed. It is preferable that the objects belonging to the group or the set are located within a certain range in space so that it is more advantageous to express the sub object position information relative to the absolute value.
Another positional information encoding method according to the present invention is to express relative information of a fixed speaker position instead of a relative expression to a main object. For example, the relative position information of the object is expressed based on the designated position value of the 22-channel speaker. At this time, the number of speakers to be used as a reference and the position value can be set based on the value set in the current contents.
In another embodiment of the present invention, quantization is performed after the position information is expressed as an absolute value or a relative value, and the quantization step is variable based on the absolute position. For example, since the frontal area of a celadon is known to have a significantly higher discriminating ability with respect to a position than a side or a rear side, it is desirable to set the quantization step so that the frontal resolution is higher than the lateral resolution. Likewise, since the resolution for the azimuth is higher than the resolution for the azimuth, it is preferable to increase the quantization for the azimuth angle.
According to another embodiment of the present invention, in the case of a dynamic object whose position is time-varying, it is possible to express the relative position with respect to the previous position value of the object instead of expressing the relative position value with respect to the main object or another reference point . Therefore, it is preferable to transmit together the flag information for distinguishing the location information of the dynamic object temporally before and spatially based on the neighboring reference points.
(Decoder overall architecture)
8 is a block diagram of an object and channel signal decoding system according to the present invention. The system may receive either the
9 is a block diagram of another object and channel signal decoding system according to the present invention. Similarly, the present system can receive the
The
(Individual for channel, parameter combination, residual)
10 is a diagram illustrating a configuration of an encoder and a decoder according to another embodiment of the present invention.
FIG. 10 shows a structure for scalable coding when the speaker setup of the decoder is different.
The encoder includes the
The
Here, the upmix parameter UP is a parameter required to upmix two or more channels to one or more channels, and may include a spatial parameter and an inter-channel phase difference (IPD).
The upmix residual signal UR corresponds to a residual signal which is a difference between the original signal CH_N and the recovered signal. The recovered signal corresponds to an upmix parameter UP to the downmix DMX And the downmixed signal by the
The
(Or one or more) of the
However, according to another embodiment of the present invention, the disadvantage can be solved by selectively providing one (or more) of the first decoder to the third decoder according to the setup environment of each device.
The
Meanwhile, the
The
Each device selectively includes one or more of a first decoder and a third decoder and selectively parses the upmix parameter UP and the upmix residual UR in the bit stream to generate a signal suitable for each speaker setup environment The complexity and the amount of computation can be reduced by creating it directly.
(Object Waveform Coding Considering Masking)
The waveform coder of an object according to the present invention (hereinafter referred to as a waveform coder) refers to a case where a channel or an object audio signal is encoded so that it can be independently decoded for each channel or object, and a concept relative to parametric encoding / (Also referred to as discrete encoding / decoding) is bit-allocated in consideration of the position on the sound scene of the object. This is based on the binaural masking level difference (BMLD) phenomenon of psychoacoustics and the features of object signal coding.
In order to explain the BMLD phenomenon, an example of MS (Mid-Side) stereo coding used in the conventional audio coding method will be described as follows. That is, the masking phenomenon in psychoacoustic is possible when the masker generating the masking and the masking masking are spatially in the same direction. The correlation between the two-channel audio signals of the stereo audio signal is very high. If the sizes are the same, the phase (sound image) of the sound is centered between the two speakers. If there is no correlation, The speaker is concealed. Since the quantization noise of each channel is not correlated with each other when the channel is dual-mono independently of the input signal having the highest correlation, the audio signal is centered, . Therefore, the quantization noise to be masked is not masked due to the spatial inconsistency, resulting in a problem that the noise is distorted to the person. In order to solve such a problem, the sum-difference coding is performed by generating a psychoacoustic model using a signal obtained by subtracting a signal (Mid signal) plus two channel signals (Difference), quantizing the result using the quantized noise, Make sure it is in the same position as the sound image.
In the conventional channel coding, each channel is mapped to a speaker to be reproduced, and since the positions of the speakers are fixed and separated from each other, masking between channels can not be considered. However, when each object is encoded independently, whether the object is masked or not may be changed according to the location on the sound scene of the object. Accordingly, it is desirable to determine whether or not the object currently encoded by another object is masked, allocate and allocate bits according to the masking.
FIG. 11 shows a masking threshold 1130 for each signal for
Another masking threshold calculation method according to the present invention is a method of calculating masking thresholds by considering the degree of separation of two objects in space, instead of adding masking thresholds for two objects when the positions of two object signals do not completely coincide with the audible angle reference It is also possible to reflect it by attenuating it. The final joint masking thresholds M1 '(f) and M2' (f) to be used to encode each object when the masking threshold for
In this case, A (f) is an attenuation factor generated by the position and distance of space between two objects and the properties of two objects, and has a range of 0.0 = <A (f) = <1.0.
The resolution of a person's direction is deteriorated as it goes from side to side with respect to the front and becomes worse when going backward. Therefore, the absolute position of an object can serve as another factor for determining A (f).
According to another embodiment of the present invention, a masking threshold value of only one of two objects is used, and a masking threshold value of a relative object is acquired only for another object. This is called an independent object dependent object. Since the object using only its own masking threshold is high-quality encoded regardless of the relative object, the sound quality can be preserved even if the object is spatially separated from the object. Assuming that
Whether an independent object or a dependent object is the additional information for each object is preferably decoded and transmitted to the renderer.
In another embodiment according to the present invention, when two objects are similar to each other in space, it is also possible to combine the signals themselves into one object, instead of merely generating the masking thresholds.
In another embodiment according to the present invention, in particular, when parameter coding is performed, it is preferable to combine the two signals into one object in consideration of the correlation between the two signals and the spatial position of the two signals.
(Transcoding feature)
In another embodiment according to the present invention, in transcoding a bit string including a coupled object, particularly when transcoding at a lower bit rate, when the number of objects is reduced in order to reduce the data size, When an object is downmixed into a single object, it is preferable to represent the coupled object as a single object.
In the description of the encoding through the inter-object coupling, only the case of coupling only two objects is described as an example for convenience of explanation, but coupling to two or more objects can also be implemented in a similar manner.
(Flexible rendering required)
Among the technologies required for 3D audio, flexible rendering is one of the key challenges to be solved to maximize the quality of 3D audio. It is well known that the position of the 5.1 channel speaker is very irregular depending on the structure of the living room and the arrangement of the furniture. Even if there is a speaker at such an irregular position, a content producer should be able to provide a sound scene intended by the user. In order to do this, it is necessary to know the speaker environment in the reproduction environment which is different for each user, A rendering technique is needed to compensate for this. That is, a series of techniques are required to decode the transmitted bit stream according to the decoding method, and not to end the codec role, but to optimize and transform it according to the user's reproduction environment.
FIG. 13 shows a layout (gray, 1310) according to the ITU-R recommendation for a 5.1 channel setup and a case (red, 1320) arranged at an arbitrary position. In the actual living room environment, there may be a problem that the direction angle and the distance are different from the ITU-R recommendation. (Although it is not shown, there may be differences in the speaker's height.) It is difficult to provide the ideal 3D sound scene when the original channel signal is reproduced from such a different speaker position.
(Flexible rendering)
Amplitude Panning, which determines the direction information of a sound source between two speakers based on the signal size, or Vector-Based Amplitude Panning (VBAP), which is widely used to determine the direction of a sound source using three speakers in a three-dimensional space It can be seen that flexible rendering can be implemented relatively conveniently for object signals transmitted on an object-by-object basis. It is one of the advantages of transmitting object signals instead of channels.
(Object Decoding and Rendering Structure)
FIG. 14 shows structures (1400 and 1401) of two embodiments in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other according to the present invention. As described above, in the case of an object, there is an advantage in that an object can be positioned as a sound source in accordance with a desired sound scene. Here, the
FIG. 15 shows a structure of still another embodiment implementing decryption and rendering of an object bit stream according to the present invention. Compared with the case of FIG. 14, it is possible to directly implement the
(Flexible rendering with channel attached)
On the other hand, when the channel signal is transmitted as an input, when the position of the speaker corresponding to the channel is changed to a certain position, it is difficult to implement using the same panning method in the case of the object and a separate channel mapping process is required. The problem is that the process and solution required for rendering the object signal and the channel signal are different from each other. Therefore, when the object signal and the channel signal are simultaneously transmitted and a sound scene in which the two signals are mixed is desired, Distortion is likely to occur. In order to solve such a problem, according to another embodiment of the present invention, flexible rendering of a channel signal is performed after a mix is first performed on a channel signal without separately performing flexible rendering on an object. Rendering using HRTF is preferably implemented in the same manner.
(Decoded downmix: parameter transmission or automatic generation)
In the case of downmix rendering, in the case of reproducing multi-channel contents through a smaller number of output channels, up to now, it has been common to implement an MN downmix matrix (M is the number of input channels and N is the number of output channels). That is, when 5.1 channel contents are reproduced in stereo, downmix is performed by a given expression. However, such a downmix implementation method has a problem of a calculation amount of decoding all bit strings corresponding to 22.2 transmitted channels, although the user's playback speaker environment is only 5.1 channels. In order to generate a stereo signal for reproduction on a portable device, if all the 22.2 channel signals need to be decoded, the computational burden is very high, and a huge amount of memory waste (storage of 22.2 channel decoded audio signals) occurs.
(Transcoding as an alternative to downmix)
As an alternative to this, a method of switching from a huge 22.2 channel original bit stream to a bit stream suitable for the target device or target playback space through effective transcoding can be considered. For example, if the content is 22.2 channel content stored in the cloud server, it is possible to implement a scenario in which the reproduction environment information is received from the client terminal, and converted and transmitted.
(Decryption sequence or downmix sequence; sequence control unit)
On the other hand, in the scenario in which the decoder and the rendering are separated, for example, there may be a case where it is necessary to decode 50 object signals together with an audio signal of 22.2 channels and transmit the decoded object signals to the renderer. Since it is a data rate signal, there is a problem that a very large bandwidth is required between the decoder and the renderer. Therefore, it is not desirable to simultaneously transmit such a large amount of data at once, and it is desirable to establish an effective transmission plan. Then, it is preferable that the decoder decides the decoding order and transmits it. FIG. 16 is a block diagram showing a structure for determining and transmitting a transmission plan between the decoder and the renderer.
The
On the other hand, the decoding control by the
(Voice highway)
On the other hand, an object corresponding to a voice corresponding to bidirectional communication may be included in the bit stream. Since bidirectional communication is very sensitive to time delay unlike other contents, when the corresponding object or channel signal is received, it should be transmitted to the renderer first. The corresponding object or channel signal can be indicated by a separate flag or the like. First of all, the transport object is independent of the other object / channel signals contained in the same frame and the presentation time, unlike the other object / channel.
(AV Matching and Phantom Center)
UHDTV is one of the new problems when considering ultra-high-definition TV, which is often called a near field. That is, considering the viewing distance of a general user environment (living room), the distance from the reproduced speaker to the listener is shorter than the distance between the speakers, so that each speaker operates as a point source, In the absence of a speaker at the center, the spatial resolution of the sound object synchronized to the video must be very high to enable high quality 3D audio service.
In a conventional audio angle of about 30 degrees, stereo speakers arranged on the left and right are not in a near field situation, and are sufficient to provide a sound scene suitable for movement of an object on the screen (for example, a vehicle moving from left to right). However, in the UHDTV environment where the audiovisual angle is 100 degrees, not only the left and right resolutions but also additional resolutions constituting the top and bottom of the screen are required. For example, if there are two characters on the screen, it would not be a big problem in realistic sense even if it is said that two sounds are ignited in current HDTV. However, in UHDTV size, Will be perceived as a new form of distortion.
One of the solutions to this problem is the configuration of 22.2 channel speaker configuration. Figure 2 is an example of a 22.2 channel arrangement. 2, a total of eleven loudspeakers are arranged on the front side to increase the left and right spatial resolution and the vertical spatial resolution of the front side. Five speakers are placed in the middle layer of the former three speakers. In addition, three upper layers and three lower layers were added so that the sound level could be sufficiently accommodated. By using such an arrangement, the spatial resolution of the front surface is higher than that of the past, which will be advantageous for matching with the video signal. However, there is a problem that, in current TVs using display devices such as LCDs and OLEDs, the display occupies a position where a speaker should be present. That is, there is a problem in that the speakers must be provided outside the display area to provide a matched sound to each object position in the screen, unless the display itself provides sound or does not have a device character that penetrates the sound. In FIG. 2, speakers corresponding to the minimums FLc, FC, and FRc are disposed at positions overlapping with the display.
17 is a conceptual diagram for explaining a concept of reproducing speakers absent from the display among the front-mounted speakers in the 22.2 channel system using the peripheral channels. In order to accommodate the components FLc, FC and FRc, it is also conceivable to arrange additional speakers in the upper and lower peripheral portions of the display, such as circles indicated by dotted lines. According to FIG. 17, there may be seven peripheral channels that can be used to generate the FLc. By using these seven speakers, it is possible to reproduce a sound corresponding to the position of a member speaker by the principle of generating a virtual source.
Techniques and properties such as VBAP or HAAS Effect (pre-effect) can be used as a method of creating virtual sources using peripheral speakers. Alternatively, different panning techniques may be applied depending on the frequency band. Further, it is possible to consider changing the azimuth using the HRTF and adjusting the height. For example, when replacing FC with BtFC, it can be implemented by adding FC channel signal to BtFC by applying HRTF with ascending property. The HRTF observation is that you have to control the position of a specific Null in the high frequency band (which varies from person to person) in order to control the height of the sound. However, in order to generalize a different Null according to a person, height adjustment can be implemented by increasing or decreasing the high frequency band widely. If such a method is used, there is a disadvantage that the signal is distorted due to the influence of the filter.
The processing method for arranging the sound sources at the position of the member speaker according to the present invention is the same as that shown in Fig. Referring to FIG. 18, a channel signal corresponding to the phantom speaker position is used as an input signal, and the input signal is passed through a
The second band is a signal to be used to be reproduced through a speaker (a speaker disposed in the bubble of the TV display and the surroundings thereof) around the phantom speaker, is divided into at least two speakers and reproduced, and a
The third band may be used to generate a signal to be reproduced using a speaker array if present, and an array
FIG. 19 shows an embodiment in which a signal generated in each band is mapped to a speaker disposed in the vicinity of the TV. According to FIG. 19, the number and position information of the speakers corresponding to the second and third bands must be located at relatively accurately defined positions, and the position information is preferably provided to the processing system of FIG.
(Composite speaker)
The composite speaker is divided into at least two audio frequency bands and has at least two kinds of speakers suitable for each band. Composite speakers divided into N bands are also called N-way speakers. At this time, the boundary frequency or cutoff frequency of each band is referred to as a crossover frequency. 20 shows an example of the frequency response of two speakers constituting a two-way speaker. The
FIG. 21 is a schematic view showing at least two or more compound speakers having different azimuth angles and different heights. For convenience, the two-way speaker is used, but the present invention is applicable to all N-way speakers. Suppose that the frequency response of the
The detent effect refers to a phenomenon in which when a sound image is positioned on a plurality of speakers, the sound image is not positioned at a desired position but pulled toward the speaker as it moves away from the speaker. This phenomenon causes the sound image to be shifted from one speaker to another speaker as if it does not move continuously and it moves like discontinuity. When the sound image is positioned using the conventional amplitude panning method, the sound image is not accurately positioned due to the detent effect. However, if you create and play back a virtual speaker, the sound image may be correctly positioned between the speakers.
FIG. 22 is a diagram of a method for making such a virtual speaker. Using the crossover frequency of the mixed loudspeaker and the input signal, the energy ratio of two or more bands is obtained at the boundary of the crossover frequency. The gain of the output channel can be calculated using the energy ratio thus obtained. Since the position of the virtual speaker may be changed according to the physiological characteristic of the user, when the user's physiological characteristic is given as an input, the
FIG. 23 is a diagram illustrating a relationship among products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented. Referring to FIG. 24, the wired /
The user authentication unit 320 performs user authentication by receiving the user information and performs at least one of the fingerprint recognition unit 320A, the iris recognition unit 320B, the face recognition unit 320C, and the voice recognition unit 320D The fingerprint, iris information, face contour information, and voice information may be input and converted into user information, and user authentication may be performed by determining whether the user information and the previously registered user data match each other .
The
The
The
The audio signal processing method according to the present invention may be implemented as a program to be executed by a computer and stored in a computer-readable recording medium. The multimedia data having the data structure according to the present invention may also be recorded on a computer- Lt; / RTI > The computer-readable recording medium includes all kinds of storage devices in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . In addition, the bit stream generated by the encoding method may be stored in a computer-readable recording medium or transmitted using a wired / wireless communication network.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It will be understood that various modifications and changes may be made without departing from the scope of the appended claims.
210: Upper layer
220: middle layer
230: bottom layer
240: LFE channel
310: wired / wireless communication device
310A: Wired communication device
310B: Infrared device
310C: Bluetooth device
310D: Wireless LAN device
320: User authentication device
320A: Fingerprint Recognition
320B: iris recognition device
320C: Facial recognition system
320D: Speech Recognition Device
330: input device
330A: Keypad
330B: Touchpad
330C: Remote control unit
340: Signal encoding device
345: audio signal processing device
350: Control device
360: Output device
360A: Speaker
360B: Display
410: first object signal group
420: second object signal group
430: Celadon
520: Downmixer &
550: Object grouping unit
560: Waveform coder
620: Waveform Decoder
630: Upmixer &
670: Object D grouping unit
860: 3DA decoding unit
870:
1110: Masking threshold curve of
1120: masking threshold curve of
1130: masking threshold curve of
1230: Psychoacoustic model
1310: 5.1 channel loudspeaker placed according to ITU-R Recommendation
1320: 5.1 channel loudspeaker is in any position
1610: 3D audio decoder
1620: 3D Audio Renderer
1630:
1810: Subband filter section
1820: Time delay filter unit
1830: Panning Algorithm
1840: speaker array control unit
2010: Crossover frequency
2020: Frequency response of woofer speaker
2030: Frequency response of tweeter speaker
2010: Woofer speaker of lower speaker
2020: Tweeter speaker of lower speaker
2030: Lower speaker
2040: Upper speaker
2050: Upper speaker's woofer speaker
2060: presence area of virtual speaker
2070: Virtual Speaker
2080: gain calculator
Claims (5)
Obtaining position information of each of the plurality of composite speakers;
Obtaining a crossover frequency of the composite speaker;
Receiving a bit string including an object signal and object position information;
Decoding the object signal and the object location information from the received bit stream;
Calculating energy distribution information of the decoded object signal;
Selecting two or more composite loudspeakers among the plurality of composite loudspeakers using the decoded object position information;
Generating an output gain value of the selected two or more composite speakers using the crossover frequency and the energy distribution of the object signal;
And rendering the decoded object signal into (?) Channel signals using the generated gain value.
Wherein the step of generating the gain value calculates a gain value using a physiological characteristic of the user.
Wherein the physiological characteristic is extracted using an image or an image.
Wherein the physiological characteristic includes at least one of information on a shape of a head, a body of a user, and a shape of an external auditory canal of a user
And the energy distribution information of the object signal includes an energy value of the object signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20130047059A KR20140128565A (en) | 2013-04-27 | 2013-04-27 | Apparatus and method for audio signal processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20130047059A KR20140128565A (en) | 2013-04-27 | 2013-04-27 | Apparatus and method for audio signal processing |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20140128565A true KR20140128565A (en) | 2014-11-06 |
Family
ID=52454412
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR20130047059A KR20140128565A (en) | 2013-04-27 | 2013-04-27 | Apparatus and method for audio signal processing |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20140128565A (en) |
-
2013
- 2013-04-27 KR KR20130047059A patent/KR20140128565A/en not_active Application Discontinuation
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9646620B1 (en) | Method and device for processing audio signal | |
KR102131748B1 (en) | Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field | |
TWI744341B (en) | Distance panning using near / far-field rendering | |
US11488610B2 (en) | Audio decoder, audio encoder, method for providing at least four audio channel signals on the basis of an encoded representation, method for providing an encoded representation on the basis of at least four audio channel signals and computer program using a bandwidth extension | |
KR20140128564A (en) | Audio system and method for sound localization | |
CA2645912C (en) | Methods and apparatuses for encoding and decoding object-based audio signals | |
CN105075293A (en) | Audio apparatus and audio providing method thereof | |
CN104054126A (en) | Spatial audio rendering and encoding | |
JP6374980B2 (en) | Apparatus and method for surround audio signal processing | |
KR102148217B1 (en) | Audio signal processing method | |
KR101949756B1 (en) | Apparatus and method for audio signal processing | |
KR20140016780A (en) | A method for processing an audio signal and an apparatus for processing an audio signal | |
KR102059846B1 (en) | Apparatus and method for audio signal processing | |
KR101950455B1 (en) | Apparatus and method for audio signal processing | |
KR101949755B1 (en) | Apparatus and method for audio signal processing | |
KR20140128565A (en) | Apparatus and method for audio signal processing | |
JP6652990B2 (en) | Apparatus and method for surround audio signal processing | |
WO2020075286A1 (en) | Audio device and audio signal output method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WITN | Withdrawal due to no request for examination |