CN106385660B

CN106385660B - Processing object-based audio signals

Info

Publication number: CN106385660B
Application number: CN201510484949.8A
Authority: CN
Inventors: 陈连武; 芦烈; J·布里巴特
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2015-08-07
Filing date: 2015-08-07
Publication date: 2020-10-16
Anticipated expiration: 2035-08-07
Also published as: CN106385660A

Abstract

Example embodiments disclosed herein relate to audio signal processing. The audio signal has a plurality of audio objects. A method of processing an audio signal is disclosed. The method comprises obtaining an object position for each audio object; and determining cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The method also includes determining an object-to-cluster gain based on the set of object locations, cluster locations, and metrics; and generating a cluster signal based on the determined cluster position and the object-to-cluster gain. Corresponding systems and computer program products are also disclosed.

Description

Processing object-based audio signals

Technical Field

Example embodiments disclosed herein relate generally to object-based audio processing and, more particularly, to a method and system for generating a cluster signal from an object-based audio signal.

Background

Traditionally, audio content in a multi-channel format (e.g. 5.1, 7.1, etc.) is created by mixing different audio signals in a studio or generated by simultaneously recording acoustic signals in a real environment. Recently, object-based audio content has become increasingly popular because it carries several audio objects and audio beds (audio bed) separately, so that the audio content can be presented with improved accuracy over conventional presentation methods. Audio objects refer to individual audio elements that may be present for a defined period of time, but also contain spatial information describing, as an example, the position, velocity, and size of each object in the form of metadata. An audio bed or bed refers to an audio channel intended to be reproduced at a predefined, fixed loudspeaker position.

For example, a cinema soundtrack may include many different sound elements corresponding to on-screen images, dialog, noise, and sound effects that appear from different locations on the screen and combine with background music and environmental effects to create an overall listening experience. Accurate playback requires that sound be reproduced in a manner that makes it as close as possible to the sound source location, intensity, movement, and depth to correspond to what is displayed on the screen.

During the delivery of the audio signal, the bed and the object may be transmitted separately and then used by the spatial rendering system to recreate the artistic objective with a plurality of speakers at known physical locations. In some cases, tens or even hundreds of individual audio objects may be included for presentation of audio content. As a result, the advent of such object-based audio data has significantly increased the complexity of rendering audio data within playback systems.

The large number of audio signals present in object-based content presents new challenges for the encoding and distribution of such content. In some distribution and delivery systems, the delivery capability may be provided with a large enough available bandwidth to send all audio beds and objects but with little or no audio compression. However, in some cases, such as blu-ray disc, broadcast (cable, satellite and terrestrial), mobile (3G and 4G) and over-the-top (OTT) distribution, the available bandwidth is not able to send all of the bed and object information created by the audio mixer. Although audio coding methods (lossy or lossless) may be applied to audio to reduce the required bandwidth, audio coding may not be sufficient to reduce the bandwidth required to transmit audio, particularly in very limited networks such as mobile 3G and 4G networks.

Some existing approaches utilize clustering of audio objects in order to reduce the number of input objects and beds to a smaller set of output clusters. Thereby, computational complexity and memory requirements are reduced. However, accuracy may be compromised because existing methods only assign objects in a relatively crude manner.

Disclosure of Invention

Example embodiments disclosed herein propose a method and system for processing an audio signal for reducing the number of audio objects by assigning these objects into clusters, while maintaining performance in terms of accuracy of spatial audio reproduction.

In one aspect, example embodiments disclosed herein provide a method of processing an audio signal. The audio signal has a plurality of audio objects. The method comprises obtaining an object position for each audio object; and determining cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The method also includes determining an object-to-cluster gain based on the set of object locations, cluster locations, and metrics; and generating a cluster signal based on the determined cluster position and the object-to-cluster gain.

In another aspect, example embodiments disclosed herein provide a system for processing an audio signal. The audio signal has a plurality of audio objects. The system includes an object position acquisition unit configured to acquire an object position for each audio object; and a cluster position determination unit configured to determine cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The system further includes an object-to-cluster gain determination unit configured to determine an object-to-cluster gain based on the set of object locations, cluster locations, and metrics; and a cluster signal generation unit configured to generate a cluster signal based on the determined cluster position and the object-to-cluster gain.

From the following description, it will be understood that object-based audio signals containing audio objects and audio beds are greatly compressed for data streaming, and thus the computational and bandwidth requirements for those signals are significantly reduced. The accurate generation of several clusters enables the reproduction of an auditory scene with high accuracy in which the listener can correctly perceive the positioning of each audio object, so that an immersive reproduction can be achieved accordingly. While the reduced requirements for data transmission rates due to efficient compression allows less compromised fidelity for any known playback system, such as speaker arrays and headphones.

Drawings

The foregoing and other objects, features and advantages of the example embodiments disclosed herein will be more readily understood from the following detailed description taken in conjunction with the accompanying drawings. The exemplary embodiments disclosed herein are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which:

fig. 1 illustrates a flow chart of a method of processing an audio signal according to an example embodiment;

fig. 2 illustrates an example flow diagram of object-based audio signal processing according to an example embodiment;

FIG. 3 illustrates a system for processing an audio signal according to an example embodiment; and

FIG. 4 illustrates a block diagram of an example computer system suitable for implementing the example embodiments disclosed herein.

Throughout the drawings, the same or corresponding reference numerals designate the same or corresponding parts.

Detailed Description

The principles of the example embodiments disclosed herein will now be described with reference to various example embodiments shown in the drawings. It should be understood that the description of these embodiments is merely intended to enable those skilled in the art to better understand and further practice the example embodiments disclosed herein, and is not intended to limit the scope in any way.

The object based audio signals are intended to be processed by a system capable of processing audio objects and their corresponding metadata. Information such as location, speed, width, etc. is provided within the metadata. These object based audio signals are typically produced by a mixer in a studio and adapted to be rendered by different systems with appropriate processors. However, since the embodiments disclosed herein focus mainly on how to allocate objects into a reduced number of clusters while maintaining performance in terms of accuracy of spatial audio reproduction, the mixing and rendering process is not specifically described.

It may be assumed that the audio signal is divided into individual frames that are subjected to analysis throughout the specification. Such a segmentation may be applied on a time domain waveform, while a filter bank or any other transform domain suitable for the example embodiments disclosed herein may be applied.

Fig. 1 illustrates a flow chart of a method 100 of processing an audio signal according to an example embodiment. In step S101, an object position for each of the audio objects is acquired. Object-based audio objects typically contain metadata that provides positional information about the object. Such information is useful for various processing techniques in cases where object-based audio content is to be presented with greater accuracy.

In step S102, cluster positions for grouping audio objects into clusters are determined based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of the determined cluster position and a quality of the determined object-to-cluster gain. Such quality may be represented by a cost function, for example, as will be described below. The cluster position refers to the centroid of a cluster from several different sets of audio objects that are close to each other. Clusters may be selected in different ways including, for example: randomly selecting a cluster location; applying an initial clustering on the plurality of audio objects to obtain cluster positions (e.g., k-means clustering); and determining a cluster position for a current time frame of the audio signal based on the cluster position for a previous time frame of the audio signal. One of the object-to-cluster gains defines a ratio at which each audio object is grouped into a corresponding one of the clusters, and the gains indicate how the audio objects are grouped into clusters.

In step S103, an object-to-cluster gain is determined based on the object position, the cluster position, and the set of metrics. Each audio object may be assigned an object-to-cluster gain for acting as a coefficient. In other words, if the object-to-cluster gain for a particular audio object is large relative to the object-to-cluster gain of a cluster, the object may be spatially near the cluster. Of course, a larger object-to-cluster gain for one audio object relative to some clusters means that the object-to-cluster gain for the same audio object relative to other clusters can be relatively small.

Steps S102 and S103 define that the determination of cluster positions is based in part on object-to-cluster gains and the determination of object-to-cluster gains is based in part on object positions, meaning that the two determination steps are interdependent. The quality of this determination may be indicated by a value associated with the metric. In general, a trend of values associated with the metric falling or converging to a predetermined value may be used to maintain the determination until the quality is sufficiently satisfactory. A predefined threshold may be set so that it can be compared to a value associated with the metric. As a result, in some embodiments, the determination of cluster position and object-to-cluster gain will be performed alternately until the value is less than a predefined threshold.

Alternatively, another predefined threshold may be set so that it can be compared to the rate of change of the value associated with the metric. As a result, in some embodiments, the cluster location and object-to-cluster gain will maintain this determination process until the rate of change of the values associated with the metrics (e.g., the rate of decline) is less than a predefined threshold.

In an embodiment, the cost function may be adapted to represent a value associated with the metric, which may thus reflect the quality of the determined cluster position and the quality of the determined object-to-cluster gain. Therefore, the calculation of the cost function will be explained in detail below.

The cost function includes various cumulative terms by considering various metrics of the clustering process. In one embodiment, each metric may include (a) a position error between a position of an audio object reconstructed in the cluster signal and a position of the audio object in the audio signal; (B) a distance error between a position of the cluster and a position of the audio object; (C) deviation of the sum of the object-to-cluster gains from unity; (D) a presentation error between presenting the cluster signal to the one or more playback systems and presenting the audio object in the audio signal to the one or more playback systems; and (E) an inter-frame inconsistency of the variable between the current time frame and a previous time frame. It is useful that the cost function is used to compare the signals before and after the clustering process, i.e. before and after the audio objects are grouped into clusters. Therefore, the cost function can be an effective index reflecting the quality of the clustering.

For metric (a), since the input audio objects can be reconstructed from the output clusters, the error between the original object position and the reconstructed object position can be used to measure the spatial position difference of the objects, describing how accurate the clustering process is for the position information.

The term "position error" may be related to the position p of the output cluster across which its signal is being output_cThe spatial position of the audio object after distribution, which relates to the spatial position of the audio object before and after the clustering process. In particular, when the original position is represented by a vector

When expressed (e.g., which may be expressed in 3 Cartesian coordinates), the location is reconstructed

Can be expressed as a source of amplitude translation, such as:

subsequently, a cost E associated with the position error_PCan be expressed as:

wherein w_oA weight representing the o-th object, which may be the energy, loudness or partial loudness of the object. g_o，cRepresenting the gain of rendering the o-th object to the c-th cluster, or object to cluster gain.

For metric (B), since rendering an audio object into clusters with a large distance between them may introduce large timbre changes, the object-to-cluster distance may be used to measure timbre changes. When an audio object is not represented by a point source (cluster) but by a virtual (phantom) source translated across numerous clusters, a change in timbre is expected. It is well known that amplitude-translated sources may have a different timbre than point sources due to comb filter interactions that may occur when one and the same signal is reproduced by two or more (virtual) loudspeakers.

The term "range error" can be represented by E_DRepresentation, which may be represented by the position of the audio object

And cluster position

The distance between reflects the increase in cost if the audio object is to be represented by a cluster that is far from the original object location:

for metric (C), the object-to-cluster gain normalization error can be used to measure the energy (loudness) change before and after the clustering process.

The term "deviation" may be represented by E_NRepresentation, which relates to gain normalization, or more specifically, to the deviation between the sum of the gains and the unit (one) for a particular cluster centroid:

for metric (D), one or more reference playback systems for that metric (e.g., mono quality on a 7.1.4 speaker playback system) may need to be specified since there are different rendering outputs for different playback systems. By comparing the difference between the rendered output of the original object and the rendered output of the cluster on a specified reference playback system, the mono quality of the clustered results can be measured.

The term "presentation error" may be represented by E_RIs shown byTo errors for a reference playback system used to measure the difference between rendering original objects to the reference playback system and rendering clusters to the reference playback system, which may be binaural, 5.1, 7.14, 9.16, etc.

Wherein

Wherein g is_o，sRepresenting the gain, g, for rendering the o-th object to the s-th output channel_c，sRepresents the gain for rendering the c-th cluster to the s-th output channel, and n_sIs to normalize the presentation differences so that the presentation errors on each channel are comparable. The parameter a is to avoid introducing too large presentation differences when the signal on the reference playback system is very small or even zero.

In one embodiment, the summation over the loudspeakers using the subscript s may be performed over one or more loudspeakers of a particular predetermined loudspeaker layout. Alternatively, clusters and objects are presented simultaneously to a larger set of speakers covering a variety of speaker layouts. For example, if one layout is a 5-channel layout and the second layout is to comprise a two-channel layout, both clusters and objects can be rendered in parallel to the 5-channel and two-channel layouts. Then, the error term E_REstimated over all 7 loudspeakers to jointly optimize the error term for both loudspeaker layouts simultaneously.

For metric (E), since the clustering process is performed as a function of the frame, inter-frame inconsistencies of some variables in the clustering process, such as object-to-cluster gain, cluster position, and reconstructed object position, can be used to measure the metric of the object. In one embodiment, the inter-frame inconsistency of reconstructed object positions may be used to measure the temporal smoothness of the clustering results.

The term "inter-frame inconsistency" may be represented by E_CIndicating that it relates to an inter-frame inconsistency of a particular variable of the reconstructed object. Suppose that

And

is the original object position in the t frame and the t-1 frame,

and

is the reconstructed object position in the t frame and the t-1 frame, and

is the target reconstruction object position in the t frame. Reconstructing the position as defined by equation (1) above

Can be expressed as a source of amplitude translation.

For preserving inter-frame smoothness, the target reconstructed object position in the t-frame may be expressed as an offset Δ of the reconstructed subtended position in the t-1 frame from the object from the t-1 frame to the t-frame_oThe combination of (A) and (B):

subsequently, a cost E associated with the inter-frame inconsistency_CCan be expressed as:

the above metrics may be measured independently or as an overall cost that is a combination of the above described metrics. In one embodiment, the overall cost may be a weighted sum of cost terms (a) through (E):

E＝α_PE_P+α_DE_D+α_NE_N+α_RE_R+α_CE_C(9)

in another embodiment, the total cost may also be the maximum of the cost term:

E＝max{α_PE_P，α_DE_D，α_NE_N，α_RE_R，α_CE_C} (10)

α therein_P，α_D，α_N，α_R，α_CWeights for cost terms (A) through (E) are expressed.

Gain g_o，cPosition, position

And

can be written as a matrix:

the object weights can be written as a diagonal matrix:

subsequently, the different cost function terms can be written as follows:

wherein H ═ diag (G)_OC1_C*O) And diag () denotes an operation of acquiring a diagonal matrix.

Represents a full 1 vector with C × 1 bins, or a length C vector such that all coefficients are equal to +1, and 1_C*ORepresenting an all 1 matrix with C × O elements.

Λ therein_ORepresentation with diagonal elements

The diagonal matrix of (a).

Wherein N is_SRepresentation with diagonal elements n_SThe diagonal matrix of (a) is,

representing a vector indicating the gain for rendering the o-th object to the reference loudspeaker, G_CSA matrix containing cluster-to-speaker gains is represented.

With the items defined above, the details of the determination process will be given in the following description.

Returning to fig. 1, in step S104, a cluster signal to be rendered is generated based on the cluster position and the object-to-cluster gain determined in steps S102 and S103. The generated cluster signal typically has a number of clusters that is much smaller than the number of audio objects contained in the audio content or audio signal, such that the demand on computational resources for rendering the auditory scene is significantly reduced.

Fig. 2 illustrates an example flow 200 of object-based audio signal processing according to an example embodiment.

Block 210 may generate a volume of audio objects, audio beds, and metadata within the audio content to be processed according to an example embodiment. Block 220 is used for a clustering process that groups a plurality of audio objects into a relatively small number of clusters. At block 230, the cluster signal is output along with the newly generated metadata, causing rendering by block 240, which is represented for the renderer of the particular audio playback system. In other words, FIG. 2 shows an overview of an ecosystem involving orchestration 210, clustering 220, allocation 230, and presentation 240. After clustering, the cluster signals and metadata may be distributed to multiple renderers aimed at different speaker playback settings or headphone reproduction.

It may be assumed that the audio content is represented by a bed (or static objects, or conventional soundtracks) and (dynamic) objects. The objects comprise audio signals and associated metadata indicative of spatial rendering information as a function of time. To reduce the data rate of multiple beds and objects, clustering with multiple beds and objects as inputs is applied, and smaller sets of objects (referred to as clusters) are generated to represent the original content in a data efficient manner.

The clustering process typically includes both determining a set of cluster locations and aggregating (or rendering) objects as clusters. These two processes have complex cross-correlation, as the presentation of objects into clusters may depend on the location of the clusters, while the overall presentation quality may depend on the cluster location and the object-to-cluster gain. It is desirable to optimize cluster position and object-to-cluster gain in a coordinated manner.

In one embodiment, the optimized object-to-cluster gains and cluster positions may be obtained by minimizing a cost function as described above. However, since there is no closed-form scheme of obtaining the optimal object-to-cluster gain and cluster position together, one exampleThe approach is to use an EM-like (expectation-maximization) iterative process to determine object-to-cluster gains and cluster positions accordingly. In step E, a cluster position P is given_CObject to cluster gain G_OCCan be determined by minimizing a cost function; in M step, an object to cluster gain G is given_OCCluster position P_CCan be determined by minimizing a cost function. The stopping criterion is used to decide whether to continue or stop the iteration.

Given cluster position P_CObject-to-cluster gain G to achieve minimum of cost function E_OCThis may be obtained at block 222 in fig. 2 by solving the following function:

wherein, for metric (a):

for metric (B):

for metric (C):

for metric (D):

for metric (E):

by solving the above equation, an object-to-cluster gain matrix is obtained, such as:

wherein:

wherein:

B_P＝0

B_D＝0

B_C＝0

A_D＝w_o(Λ_o+Λ_o ^T)

A_R＝w_o(2G_CSN_SG_CS ^T)

it follows that the object-to-cluster gain can be determined based on cluster position.

Given object-to-cluster gain G_OCLocal minimum of cost function E and optimal cluster position P_CThis may be obtained at block 221 in fig. 2 by solving the following function:

however, since there is no closed form solution to the above equation, the optimal cluster position P is obtained by the gradient descent method_C：

Where i denotes the number of iterations of the gradient descent and σ denotes the learning step. For metrics (a), (B), and (C), the gradient of each cost term can be derived as follows:

where tr { } represents the matrix tracking function of the diagonal elements of the summation matrix.

Wherein p is_CxDenotes the position of the c-th output cluster (from 1 to c) along the x-axis in a 3-Cartesian coordinate system, p_CyDenotes the position of the c-th output cluster along the y-axis in a 3-Cartesian coordinate system, p_CzIndicating the position of the c-th output cluster along the z-axis in a 3 cartesian coordinate system. For the metric (D), having:

wherein g is_c，sRepresenting the gain of rendering the clusters into the reference playback system,

and

representing the gradient of the presented gain.

For example, for a standard Atmos renderer, the gain may be calculated as follows:

g_c，s(p_cx，p_cy，p_cz)＝f_sx(p_cx)f_sy(p_cy)f_sz(p_cz) (35)

wherein f is_sx()、f_sy() And f_sz() Represents the gain function of the Atmos renderer on the s channel with respect to x position, y position and z position, respectively, and for the metric (E):

it follows that cluster positions can be determined based on object-to-cluster gains.

There are many ways to initialize the cluster location for this iterative process. For example, random initialization or k-means based initialization may be used to initialize the cluster position for each processing frame. However, to avoid convergence to different local minima in adjacent frames, the cluster position of the previous frame acquired may be used to initialize the cluster position of the current frame. Furthermore, for example, a hybrid method of selecting a cluster position having the smallest cost from a plurality of different initialization methods may be applied to initialize the determination process.

After performing either of the steps represented by

blocks

221 and 222, the cost function will be evaluated at block 223 to test whether the value of the cost function is small enough to stop the iteration. The iteration will be stopped when the value of the cost function is smaller than a predefined threshold, or the rate of decline of the cost function value is very small. The predefined threshold value may be manually set in advance by the user. In another embodiment, the steps represented by

blocks

221 and 222 may be alternately implemented until the value of the cost function, or the rate of change thereof, equals a predefined threshold. In some use cases, it may be sufficient to perform the steps represented by

blocks

221 and 222 in fig. 2 only a predetermined number of times, rather than performing these steps until the overall error reaches a threshold.

It is to be understood that the EM iterative method described above is merely an example embodiment, and that other rules may be applied to jointly estimate cluster position and object-to-cluster gain.

This iterative step or determination process ensures that a plurality of clusters with improved accuracy are generated, so that an immersive reproduction of the audio content can be achieved. While the reduced requirements for data transmission rates due to efficient compression allows less compromised fidelity for any known playback system, such as speaker arrays and headphones.

Fig. 3 illustrates a system 500 for processing an audio signal comprising a plurality of audio objects according to an example embodiment. As shown, the system 300 includes an object position acquisition unit 301 configured to acquire an object position for each audio object; and a cluster position determination unit 302 configured to determine cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The system 300 further comprises an object-to-cluster gain determination unit configured to determine an object-to-cluster gain based on the set of object positions, cluster positions, and metrics; and a cluster signal generation unit 304 configured to generate a cluster signal to be presented based on the determined cluster position and the object-to-cluster gain.

In an example embodiment, the system 300 may further comprise an alternation determination unit configured to alternately perform the determination of the cluster position and the determination of the object-to-cluster gain until a predetermined condition is satisfied. In a further embodiment, the predetermined condition may comprise at least one of: the value associated with the metric is less than a predefined threshold and a rate of change of the value associated with the metric is less than another predefined threshold.

In another example embodiment, the metrics may include at least one of: a position error between a position of the audio object reconstructed in the cluster signal and the object position; a distance error between the cluster position and the object position; a deviation of a sum of object-to-cluster gains from one; a presentation error between presenting the cluster signal to the one or more playback systems and presenting the audio signal to the one or more playback systems; and inter-frame inconsistency of the variable between the current time frame and the previous time frame. In a further example embodiment, the variable may comprise at least one of an object-to-cluster gain, a cluster position, and a position of the reconstructed audio object. Alternatively, the alternation determination unit may be further configured to alternatingly perform the determination of the cluster position and the determination of the object-to-cluster gain based on a weighted combination of the set of metrics.

In yet another example embodiment, the system 300 may further include a cluster location initialization unit configured to initialize the cluster location based on at least one of: randomly selecting a cluster location; applying an initial clustering on the plurality of audio objects to obtain cluster positions; and determining a cluster position for a current time frame of the audio signal based on the cluster position for a previous time frame of the audio signal.

For clarity, some optional components of system 300 are not shown in fig. 3. It should be understood, however, that the features described above with reference to fig. 1-2 are applicable to the system 300. Furthermore, the components of the system 300 may be hardware modules or software unit modules. For example, in some embodiments, system 300 may be partially or completely implemented in software and/or firmware, e.g., as a computer program product embodied in a computer-readable medium. Alternatively or additionally, system 300 may be partially or completely implemented in hardware, e.g., as an Integrated Circuit (IC), an Application Specific Integrated Circuit (ASIC), a system on a chip (SOC), a Field Programmable Gate Array (FPGA), or the like. The scope of the invention is not limited in this respect.

FIG. 4 illustrates a block diagram of an example computer system 400 suitable for implementing example embodiments disclosed herein. As shown, the computer system 400 includes a Central Processing Unit (CPU)401 capable of executing various processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage area 408 to a Random Access Memory (RAM) 403. In the RAM 403, when the CPU 401 executes various processes and the like, necessary data is also stored as necessary. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as needed, so that a computer program read out therefrom is mounted into the storage section 408 as needed.

In particular, according to example embodiments disclosed herein, the processes described above with reference to fig. 1 to 2 may be implemented as computer software programs. For example, example embodiments disclosed herein include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method 100. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411.

In general, the various example embodiments disclosed herein may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the example embodiments disclosed herein are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Also, blocks in the flow diagrams may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements understood to perform the associated functions. For example, example embodiments disclosed herein include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program containing program code configured to perform the method described above.

In the context of this disclosure, a machine-readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination thereof. More detailed examples of a machine-readable storage medium include an electrical connection with one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical storage device, a magnetic storage device, or any suitable combination thereof.

Computer program code for carrying out methods of the present invention may be written in one or more programming languages. These computer program codes may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the computer or other programmable data processing apparatus, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. The program code may execute entirely on the computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed between one or more remote computers or servers.

Additionally, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking or parallel processing may be advantageous. Likewise, while the above discussion contains certain specific implementation details, this should not be construed as limiting the scope of any invention or claims, but rather as describing particular embodiments that may be directed to particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in any suitable subcombination, separately, in multiple embodiments.

Various modifications, adaptations, and other embodiments of the present invention will become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings. Any and all modifications will still fall within the scope of the non-limiting and exemplary embodiments of this invention. Moreover, the foregoing description and drawings provide instructive benefits, and other example embodiments set forth herein will occur to those skilled in the art to which such embodiments pertain.

Accordingly, example embodiments disclosed herein may be embodied in any of the forms described herein. For example, the following listing of example embodiments (EEEs) describes some of the structure, features, and functionality of some aspects of the present invention.

EEE 1. a method of processing object-based audio data, comprising:

determining a plurality of metric-based cost functions for combining the first plurality of audio objects into the second plurality of audio objects.

Combining the first plurality of audio objects into the second plurality of audio objects by jointly optimizing spatial positions and rendering gains of the second plurality of audio objects to minimize a cost function.

EEE 2. the method according to EEE 1, wherein the plurality of metrics comprises at least one of:

spatial representation

Timbre preservation

Loudness preservation

Quality of the monaural

Temporal smoothness

EEE 3. the method according to EEE 2, wherein the spatial representation can be measured by the position error of the reconstruction of the object.

EEE 4. the method according to EEE 2, wherein the timbre preservation can be measured by the object-to-cluster distance.

EEE 5. the method according to EEE 2, wherein the loudness retention can be measured by the object-to-cluster gain normalization error.

EEE 6. the method according to EEE 2, wherein the mono quality can be measured by a rendering error on at least one or more predefined reference playback systems.

EEE 7. the method according to EEE 2, wherein the practical smoothness may be measured by an inter-frame inconsistency of at least one variable in the clustering result.

EEE 8. the method according to EEE 7, wherein the variable may be an object-to-cluster gain, a cluster position, or a reconstructed object position.

EEE 9. the method according to EEE 1, wherein the cost function may be a combination of cost terms based on a plurality of metrics.

EEE 10. the method according to EEE 9, wherein different weights are applied to the cost terms of a plurality of metrics.

EEE 11. the method of EEE 10, wherein the different weights are determined in response to human input.

EEE 12. the method according to EEE 11, wherein an E-M-like iterative optimization method can be used to minimize the cost function.

EEE 13. the method according to any of the preceding EEEs, wherein the one or more reference speaker settings are determined by human input.

EEE 14. the method according to any of the preceding EEEs, wherein the reference renderer may be any one of a speaker renderer or an earpiece renderer.

Claims

1. A method of processing an audio signal comprising a plurality of audio objects, comprising:

obtaining an object position for each of the audio objects;

determining cluster positions for grouping the audio objects into clusters based on the object positions and a set of metrics indicative of a quality of the cluster positions and a quality of the object-to-cluster gains, given a plurality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and the plurality of object-to-cluster gains being indicative for each of the audio objects for determining a reconstructed object position of the audio object from the cluster positions of the clusters;

determining the plurality of object-to-cluster gains based on the set of object positions and metrics given the cluster position, wherein the steps of determining cluster positions and determining object-to-cluster gains are interdependent and are part of an iterative process until a predetermined condition is satisfied; and

generating a cluster signal based on the determined cluster position and the object-to-cluster gain;

wherein the metric comprises at least one of:

a position error between a position of an audio object reconstructed in the cluster signal and the object position;

a distance error between the cluster location and the object location;

a deviation of a sum of the object-to-cluster gains from a value of 1;

a presentation error between presenting the cluster signal to one or more playback systems and presenting the audio signal to the one or more playback systems; and

inter-frame inconsistency of the variable between a current time frame and a previous time frame; and is

Wherein the variable comprises at least one of the object-to-cluster gain, the cluster position, and the position of the reconstructed audio object.

2. The method of claim 1, further comprising:

alternately performing the determination of the cluster position and the determination of the object-to-cluster gain until the predetermined condition is satisfied.

3. The method of claim 2, wherein the predetermined condition comprises at least one of:

a value associated with the metric being less than a predefined threshold, an

A rate of change of the value associated with the metric is less than another predefined threshold.

4. The method of claim 2, wherein the alternately performing the determination of the cluster location and the determination of the object-to-cluster gain is based on a weighted combination of the set of metrics.

5. The method of any of claims 1 to 3, further comprising:

initializing the cluster location based on at least one of:

randomly selecting the cluster location;

applying an initial clustering on the plurality of audio objects to obtain the cluster position; and

determining the cluster position for a current time frame of the audio signal based on the cluster position for a previous time frame of the audio signal.

6. The method of claim 1, wherein

A large object-to-cluster gain for an audio object relative to a cluster indicates that the audio object is in the vicinity of the cluster, and vice versa;

an object-to-cluster gain for the audio object relative to a cluster having a cluster position represents a gain for rendering the audio object to the cluster position of the cluster; and/or

The plurality of object-to-cluster gains comprises an object-to-cluster gain for each audio object of the plurality of audio objects relative to each cluster of the clusters.

7. A method of processing an audio signal comprising a plurality of audio objects, comprising:

obtaining an object position for each of the audio objects;

determining cluster positions for grouping the audio objects into clusters based on the object positions and a set of metrics indicative of a quality of the cluster positions and a quality of the object-to-cluster gains, given a plurality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and the plurality of object-to-cluster gains being indicated for each of the audio objects for determining a reconstructed object position of the audio object from the cluster positions of the clusters;

determining the plurality of object-to-cluster gains based on the object locations and the set of metrics given the cluster locations, wherein the steps of determining cluster locations and determining object-to-cluster gains are interdependent and are part of an iterative process until a predetermined condition is satisfied; and

wherein

Is a vector representing the cluster position of the c-th cluster;

g_o,cis the object-to-cluster gain of the o-th object relative to the c-th cluster; and is

Is a vector representing the reconstructed object position of the o-th object, wherein

8. A system for processing an audio signal comprising a plurality of audio objects, comprising:

an object position acquisition unit configured to acquire an object position for each of the audio objects;

a cluster position determination unit configured to: determining cluster positions for grouping the audio objects into clusters based on the object positions and a set of metrics indicative of a quality of the cluster positions and a quality of the object-to-cluster gains, given a plurality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and the plurality of object-to-cluster gains being indicative for each of the audio objects for determining a reconstructed object position of the audio object from the cluster positions of the clusters;

an object-to-cluster gain determination unit configured to determine the plurality of object-to-cluster gains based on the object locations and the set of metrics given the cluster locations, wherein the steps of determining cluster locations and determining object-to-cluster gains are interdependent and are part of an iterative process until a predetermined condition is satisfied; and

a cluster signal generation unit configured to generate a cluster signal based on the determined cluster position and the object-to-cluster gain;

wherein the metric comprises at least one of:

a distance error between the cluster location and the object location;

a deviation of a sum of the object-to-cluster gains from a value of 1;

9. The system of claim 8, further comprising:

an alternation determination unit configured to alternately perform the determination of the cluster position and the determination of the object-to-cluster gain until the predetermined condition is satisfied.

10. The system of claim 9, wherein the predetermined condition comprises at least one of:

a value associated with the metric being less than a predefined threshold, an

11. The system of claim 9, wherein the alternation determination unit is further configured to alternately perform the determination of the cluster position and the determination of the object-to-cluster gain based on a weighted combination of the set of metrics.

12. The system of any of claims 8 to 10, further comprising:

a cluster position initialization unit configured to initialize the cluster position based on at least one of:

randomly selecting the cluster location;

13. A computer-readable medium storing a computer program executable by a processor to implement the steps of the method according to any one of claims 1 to 6.