CN116965062A

CN116965062A - Clustering audio objects

Info

Publication number: CN116965062A
Application number: CN202280015933.0A
Authority: CN
Inventors: 杨子瑜; 芦烈
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2021-02-20
Filing date: 2022-02-15
Publication date: 2023-10-27

Abstract

A method for clustering audio objects may involve identifying a plurality of audio objects, wherein each audio object of the plurality of audio objects is associated with respective metadata indicating respective spatial location information and respective rendering metadata. The method may involve assigning an audio object of the plurality of audio objects to a rendering metadata category of a plurality of rendering metadata categories, wherein at least one rendering metadata category comprises a plurality of rendering metadata types to be maintained. The method may involve determining an allocation of a plurality of clusters of audio objects to each rendering metadata category. The method may involve rendering the audio objects of the plurality of audio objects to an assigned plurality of audio object clusters based on metadata indicating spatial location information and based on assignment of the audio objects to rendering metadata categories.

Description

Clustering audio objects

Cross Reference to Related Applications

The present application claims priority from the following priority applications: international patent application PCT/CN2021/077110 filed on 20/2/2021; U.S. provisional patent application 63/165,220 filed on 24 days 3 of 2021; U.S. provisional patent application 63/202,227 filed on 2/6/2021; and european patent application 21178179.4 filed on 6.8 of 2021, which are hereby incorporated by reference.

Technical Field

The present disclosure relates to systems, methods, and media for clustering audio objects.

Background

Audio content rendering devices capable of rendering spatially located audio content are becoming increasingly popular. For example, such an audio content presentation device may be capable of presenting audio content perceived as being at various spatial locations within a three-dimensional environment of a listener. While some existing audio content presentation methods and devices provide acceptable performance under some conditions, improved methods and devices may also be desirable.

Symbols and terms

Throughout this disclosure, including in the claims, the terms "speaker (speaker)", "loudspeaker (loudspecker)" and "audio reproduction transducer" are synonymously used to denote any sound producing transducer (or set of transducers). A typical set of headphones includes two speakers. The speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter) that may be fed by a single common speaker or multiple speaker feeds. In some examples, one or more speaker feeds may undergo different processing in different circuit branches coupled to different transducers.

Throughout this disclosure, including in the claims, the expression "performing an operation on" a signal or data (e.g., filtering, scaling, transforming, or applying gain to a signal or data) is used in a broad sense to mean performing an operation directly on a signal or data or on a processed version of a signal or data (e.g., a version of a signal that has undergone preliminary filtering or preprocessing prior to performing an operation thereon).

Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.

Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to mean a system or device that is programmable or otherwise configurable (e.g., in software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chip sets), digital signal processors programmed and/or otherwise configured to perform pipelined processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chip sets.

Throughout this disclosure, including in the claims, the term "cluster" or "clusters" is used to mean a cluster of audio objects. The terms "cluster" and "audio object cluster" should be understood as synonymous and used interchangeably. An audio object cluster is a combination of audio objects having one or more similar properties, such as audio objects having similar spatial locations and/or similar rendering metadata. In some cases, audio objects may be assigned into a single cluster, while in other cases, audio objects may be assigned into multiple clusters.

Disclosure of Invention

At least some aspects of the present disclosure may be implemented via a method. Some methods may involve identifying a plurality of audio objects, wherein each audio object of the plurality of audio objects is associated with respective metadata indicating respective spatial location information and respective rendering metadata. Some methods may involve assigning an audio object of the plurality of audio objects to a rendering metadata category of a plurality of rendering metadata categories, wherein at least one rendering metadata category includes a plurality of rendering metadata types to be maintained. Some methods may involve determining an assignment of a plurality of audio object clusters to each rendering metadata category, wherein an audio object cluster includes one or more audio objects of the plurality of audio objects having similar attributes. Some methods may involve rendering audio objects of the plurality of audio objects to an assigned plurality of audio object clusters based on metadata indicating spatial location information and based on assignment of the audio objects to rendering metadata categories.

In some examples, the rendering metadata categories include bypass mode categories and virtualization categories. In some examples, the plurality of rendering metadata types included in the virtualization category includes a plurality of virtualization types, each representing a distance from a head center to the audio object.

In some examples, the rendering metadata category includes one of a region category or a capture category.

In some examples, audio objects assigned to a first rendering metadata category are prohibited from being assigned to an audio object cluster of the plurality of audio object clusters that is assigned to a second rendering metadata category.

In some examples, determining the allocation of the plurality of audio object clusters to each rendering metadata category involves: (i) Determining an initial allocation of an initial plurality of audio object clusters to each rendering metadata category; (ii) Assigning audio objects to an initial plurality of audio object clusters based on metadata indicating spatial location information and based on assignment of audio objects to rendering metadata categories; (iii) Determining, for each rendering metadata category, a category cost for assigning audio objects to an initial plurality of audio object clusters; (iv) Determining an updated allocation of the initial plurality of audio object clusters to each rendering metadata category based at least in part on the category cost of each rendering metadata category; and (iv) repeating (ii) to (iv) until a stopping criterion is reached. In some examples, determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on the location of the audio object cluster assigned to the rendering metadata class and the location of the audio object assigned to the audio object cluster assigned to the rendering metadata class. In some examples, the category cost is based on a left-to-right placement of an audio object relative to a left-to-right placement of the audio object cluster to which the audio object has been assigned. In some examples, determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on the loudness of the audio object. In some examples, determining the class cost to assign an audio object to the initial plurality of audio object clusters is based on a distance of the audio object to the audio object cluster to which the audio object has been assigned. In some examples, determining the class cost to assign an audio object to the initial plurality of audio object clusters is based on a similarity of a rendering metadata type of the audio object to a rendering metadata type of the audio object cluster to which the audio object has been assigned. In some examples, the method may involve determining a global cost based on a class cost for each rendering metadata class, wherein the update allocation of the initial plurality of audio object clusters is based on the global cost. In some examples, determining the update allocation includes changing a number of audio object clusters allocated to at least one of the plurality of rendering metadata categories. In some examples, the method may further involve determining a global cost based on the class cost for each rendering metadata class, wherein the number of audio object clusters is determined based on the global cost. In some examples, determining the number of audio object clusters includes minimizing a global cost under a constraint on the number of audio object clusters that indicates a maximum number of audio object clusters that can be added.

In some examples, rendering the audio objects of the plurality of audio objects to the assigned plurality of audio object clusters includes determining an object-to-cluster gain of each audio object of the plurality of audio objects when rendered to one or more audio object clusters assigned to a rendering metadata class to which the audio object is assigned. In some examples, the object-to-cluster gain assigned to the audio object of the first one of the plurality of rendering metadata categories is determined separately from the object-to-cluster gain assigned to the audio object of the second one of the plurality of rendering metadata categories. In some examples, the object-to-cluster gain assigned to the audio object of the first one of the plurality of rendering metadata categories and the object-to-cluster gain assigned to the audio object of the second one of the plurality of rendering metadata categories are jointly determined.

Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices in accordance with instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. Thus, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.

At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some embodiments, the apparatus is or includes an audio processing system having an interface system and a control system. The control system may include one or more general purpose single or multi-chip processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or a combination thereof.

The present disclosure provides various technical advantages. For example, audio objects that may be associated with spatial location information and rendering metadata indicating the manner in which the audio objects are to be rendered may be clustered in a manner that maintains rendering metadata across different rendering metadata categories. In some cases, rendering metadata may not be maintained when audio objects within the same rendering metadata category are clustered. By clustering audio objects using a mixing method that maintains rendering metadata based on rendering metadata categories, the techniques described herein allow for generating audio signals with clustered audio objects, which reduces spatial distortion when rendering audio signals, and reduces the bandwidth required to transmit such audio signals. Such an audio signal may advantageously be more faithful to the intent of the audio content creator associated with the audio signal.

The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.

Drawings

Fig. 1A and 1B illustrate representations of example audio object clusters based on rendering metadata and spatial positioning metadata, according to some embodiments.

Fig. 2 illustrates an example of a process for clustering audio objects based on spatially positioned metadata while preserving rendering metadata, according to some embodiments.

Fig. 3 illustrates an example of a process for determining cluster allocation according to some embodiments.

Fig. 4 illustrates an example of a process for assigning audio objects to assigned clusters according to some embodiments.

Fig. 5 shows a block diagram illustrating an example of components of an apparatus capable of implementing aspects of the disclosure.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

Audio content rendering devices (whether rendered via loudspeakers or headphones) capable of rendering spatially located audio content are becoming increasingly popular. For example, such an audio content presentation device may be capable of presenting audio content perceived as being at various spatial locations within a three-dimensional environment of a listener. Such audio content may be encoded in an audio format that includes an "audio bed" that includes audio content to be rendered at fixed spatial locations and an "audio object" that includes audio content that may be rendered at varying spatial locations and/or for different durations. For example, the audio object may represent sound effects associated with a moving object (e.g., a buzzing insect, a moving vehicle, etc.), music from a moving instrument (e.g., an instrument moving in a traveling band, etc.), or other audio content that may move in terms of location.

Each audio object may be associated with metadata describing how the audio object is to be rendered (referred to herein generally as "rendering metadata") and/or a spatial location at which the audio object is to be perceived as being in when rendered (referred to herein generally as "spatial location metadata"). For example, the spatial location metadata may indicate a location within a three-dimensional (3D) space where the audio object, when rendered, will be perceived by a listener as being located. The spatial location metadata may specify an azimuth location of the audio object and/or an elevation location of the audio object. As another example, the rendering metadata may indicate the manner in which the audio object is to be rendered. It should be noted that example rendering metadata types for the headphone rendering mode may be different from the rendering metadata types for the speaker rendering mode. In some implementations, the rendering metadata can be associated with a rendering metadata category. For example, rendering metadata associated with a headphone rendering mode may be associated with a first category corresponding to a "bypass mode" category in which room virtualization is not applied when rendering audio objects assigned to the first category and a second category corresponding to a "room virtualization" category in which room virtualization techniques are applied when rendering audio objects assigned to the second category. Continuing the example further, in some embodiments, a rendering metadata category may have multiple rendering metadata types within the category. As a more specific example, rendering metadata associated with a "room virtualization" category of rendering metadata may have multiple rendering metadata types, such as "near," "middle," and "far," each of which may indicate a relative distance from a listener's head to a location within a room where an audio object is to be rendered. As another example, rendering metadata associated with a speaker rendering mode may be associated with a first rendering metadata category corresponding to a "capture (snap)" mode that indicates that an audio object is to be rendered to a particular speaker to enable point-sound source type rendering and a second rendering metadata category corresponding to a "zone mask" mode that indicates that an audio object is not to be rendered to a particular speaker included in a particular speaker group (generally referred to herein as a "zone mask"). As a more specific example, in some embodiments, the "capture" category of rendering metadata may include a rendering metadata type corresponding to a particular speaker. In some embodiments, the "capture" category of rendering metadata may include binary values, wherein in response to rendering metadata being "1" or "yes" (indicating "capture" enabled), the audio object may be rendered by the closest speaker. As another more specific example, the "region mask" category of rendering metadata may include rendering metadata types corresponding to different speaker groups (e.g., "left surround and right surround", "left and right", etc.) that will not be used to render the audio object. In some embodiments, the "region mask" category of rendering metadata may indicate one or more speakers (e.g., "front", "rear", etc.) to which the audio object is to be rendered, and other speakers will be excluded or prohibited from rendering the audio object.

Metadata associated with the audio objects (whether spatial location metadata or rendering metadata) may be specified by the audio content creator and may thus represent the artistic intent of the audio content creator. Thus, it may be important to maintain spatial location metadata and/or rendering metadata so as to faithfully represent the artistic desires of the audio content creator. However, in some cases, such as in the vocal cords of a movie or television program, the audio content may include tens or hundreds of audio objects. Thus, audio content formatted to include audio objects may be large in size and quite complex. Thus, transmitting such audio content for rendering may be difficult and may require a large amount of bandwidth. The increased bandwidth requirements may be particularly problematic for viewers or listeners at home of such audio content, which may be more constrained by bandwidth considerations when viewing or listening to such audio content at home than movie theatres and the like.

To reduce audio content complexity, audio objects may be clustered based at least in part on spatial positioning metadata such that audio objects that are relatively close in location (e.g., azimuth position and/or elevation position) are assigned to the same cluster of audio objects. The audio object clusters may then be transmitted and/or rendered. By rendering audio objects assigned to the same audio object cluster using aggregated metadata associated with the audio object cluster, spatial complexity may be reduced, thereby reducing bandwidth for transmitting and/or rendering audio signals.

However, clustering audio objects without regard to rendering metadata and the class of rendering metadata to which each audio object has been assigned may create perceptual discontinuities. For example, assigning an audio object assigned to a "bypass mode" class of rendering metadata to a cluster associated with a "room virtualization" class of rendering metadata may result in perceptual distortion, even though the audio object and other audio objects assigned to the cluster are associated with similar azimuth and/or elevation spatial locations. In particular, by assigning audio objects to clusters associated with the "room virtualization" category of rendering metadata, the audio objects may undergo a transformation using Head Related Transfer Functions (HRTFs) to simulate a propagation path from a source to a listener's ear. HRTF transforms may distort the perceived quality of an audio object, for example, due to the introduction of timbre variations associated with rendering the audio object and/or due to the introduction of temporal discontinuities if several frames of audio content are assigned to different categories. Further, because the first audio object is assigned to the "bypass mode" category by the audio content creator, rendering the first audio object using HRTFs to be applied to the audio object assigned to the "room virtualization" category of audio objects may cause the first audio object to be rendered in a manner that is not faithful to the intent of the audio content creator.

Clustering audio objects in a manner that strictly maintains rendering metadata categories and/or strictly maintains rendering metadata types within a particular rendering metadata category may also have certain consequences. For example, clustering audio objects with strictly maintained rendering metadata may require a relatively large number of clusters, which increases the complexity of the audio signal and may require higher bandwidth for audio signal encoding and transmission. Alternatively, clustering audio objects with strictly maintained rendering metadata and with a limited number of clusters may result in spatial distortion by having two audio objects with the same rendering metadata but located relatively far from each other be rendered to the same cluster.

Techniques, systems, methods, and media described herein describe assigning and/or generating clusters of audio objects that in some cases maintain rendering metadata categories, while in other cases allow for clustering of audio objects associated with a particular rendering metadata category or a particular rendering metadata type within a rendering metadata category and audio objects associated with a different rendering metadata category or a different rendering metadata type. The techniques, systems, methods, and media described herein may allow for reducing spatial complexity by clustering audio objects, thereby reducing the bandwidth required to transmit and/or render such audio objects, while also improving the perceived quality of rendered audio objects by maintaining rendering metadata in some cases and not in others. In particular, by allowing flexible use of rendering metadata categories or types when assigning audio objects to audio object clusters, spatial distortion generated by strict rendering metadata constraints during clustering may be reduced or eliminated, while still achieving a reduction in complexity of the audio content, resulting in a reduction in bandwidth required to transmit such audio content. An audio object cluster may be considered to be associated with audio objects having similar attributes, where the similar attributes may include similar spatial locations and/or similar rendering metadata (e.g., same rendering metadata category, same rendering metadata type, etc.). The similarity of spatial locations may be determined based on a distance (e.g., euclidean distance and/or any other suitable distance metric) between an audio object and a centroid of a cluster to which the audio object is assigned. In embodiments where an audio object may be rendered to multiple clusters of audio objects, the audio object may be associated with multiple weights, each weight corresponding to one cluster of audio objects, where the weights indicate how far the audio object is rendered to a particular cluster. Continuing with this example, where the audio object is relatively far from a particular audio object cluster (e.g., the spatial location associated with the audio object is relatively far from the centroid associated with the audio object cluster), the weight associated with the audio object cluster may be relatively small (e.g., near or equal to 0). In some embodiments, two audio objects may be considered to have similar attributes based on the similarity of weights associated with each of the two audio objects indicating how much each audio object is rendered to a particular audio object cluster.

In some implementations, clusters of audio objects may be generated such that audio objects assigned to a particular rendering metadata category (e.g., "bypass mode") are prohibited from being assigned to clusters having audio objects assigned to other rendering metadata categories (e.g., "virtualization mode"). In some such implementations, audio objects within a particular rendering metadata category may be assigned to clusters containing audio objects having the same rendering metadata type within the particular category and/or clusters containing audio objects having different rendering metadata types within the particular category. For example, in some implementations, a first audio object assigned to a "virtualization mode" category and having a "near" rendering metadata type (e.g., indicating that the first audio object is to be rendered relatively close to the listener's head) may be assigned to a cluster that includes a second audio object assigned to the "virtualization mode" category and having a "mid" rendering metadata type (e.g., indicating that the second audio object is to be rendered within a mid-distance range from the source to the listener's head). Continuing with this example, in some implementations, the first audio object may be prohibited from being assigned to a cluster that includes a third audio object assigned to the "virtualization mode" category and having a "far" rendering metadata type (e.g., indicating that the third audio object is to be rendered relatively far from the listener's head).

FIG. 1A illustrates an example 100 of a representation of a clustering of audio objects, wherein audio objects assigned to a particular rendering metadata category are not allowed to be clustered with audio objects assigned to other rendering metadata categories.

In example 100, there are two rendering metadata categories. Category 102 (denoted "category 1" in fig. 1A) corresponds to an audio object associated with "bypass mode" rendering metadata. Category 104 (denoted as "category 2" in fig. 1A) corresponds to an audio object associated with "virtualization mode" rendering metadata. The "virtualization mode" category of rendering metadata may have various potential rendering metadata types, such as "near", "medium", and/or "far" distances from the listener's head. Thus, audio objects assigned to the "virtualization mode" category of rendering metadata may have a rendering metadata type selected from one of "near", "medium", or "far" (as shown in fig. 1A and as depicted in fig. 1A by the shadow type applied to each audio object).

Fig. 1A illustrates a set of audio objects (e.g., audio object 106) that have been clustered based on spatial location metadata associated with the audio objects and based on rendering metadata categories associated with the audio objects. The assigned clusters are indicated as numbers within the circles depicting each audio object. For example, as shown in FIG. 1A, audio object 106 has been assigned to cluster "1". As another example, within category 104, audio object 108 has been assigned to cluster "4".

In the example 100 of fig. 1A, rendering metadata categories are strictly maintained when generating audio object clusters. For example, audio objects assigned to the "bypass mode" class of rendering metadata are prohibited from being assigned to clusters assigned to the "virtualization mode" class of rendering metadata. Similarly, audio objects assigned to the "virtualization mode" class of rendering metadata are prohibited from being assigned to clusters assigned to the "bypass mode" class of rendering metadata.

In example 100 of fig. 1A, audio objects assigned to a particular rendering metadata category may be clustered with other audio objects assigned to the same rendering metadata category but having different rendering metadata types within the category. For example, within category 104, audio objects 110 associated with "near" rendering metadata types within the "virtualization mode" category may be clustered with audio objects 112 and 114 each associated with "middle" rendering metadata types within the "virtualization mode" category. As another example, within category 104, audio objects 116 associated with a "middle" rendering metadata type within a "virtualization mode" category of rendering metadata and audio objects 118 and 120 each associated with a "far" rendering metadata type within a "virtualization mode" category of rendering metadata may be clustered.

It should be noted that the clustering of audio objects depicted in example 100 may be the result of a clustering algorithm or technique. For example, the clusters of audio objects depicted in example 100 may be generated using the techniques shown in and described below in connection with process 200 of fig. 2. In some implementations, an optimization algorithm or technique may be used to determine the number of audio object clusters assigned to each category shown in fig. 1A and/or the spatial centroid position of each cluster. For example, the allocation of clusters of audio objects may be iteratively determined to generate an optimal allocation using techniques shown in and described below in connection with process 300 of fig. 3. Additionally, in some implementations, assigning audio objects to a particular cluster may be accomplished by determining an object-to-cluster gain that describes a ratio or gain of the audio objects when rendered to the particular cluster, as described below in connection with process 400 of fig. 4.

In contrast, fig. 1B illustrates an example 150 of a representation of clusters of audio objects, wherein audio objects assigned to a particular rendering metadata category are allowed to be assigned to clusters assigned to other rendering metadata categories in some cases.

As illustrated in fig. 1B, audio objects assigned to a particular rendering metadata category may be allowed to be assigned to clusters assigned to different rendering metadata categories. For example, audio objects 152 and 154, each assigned to a "virtualization mode" category, are assigned to clusters assigned to a "bypass mode" category (e.g., category 102 of fig. 1B). As another example, audio objects 156 and 158, each assigned to a "bypass mode" category, are assigned to clusters assigned to a "virtualization mode" category (e.g., category 104 of fig. 1B).

It should be noted that while fig. 1A and 1B illustrate each audio object being assigned to a single cluster, audio objects may be assigned or rendered to multiple clusters (as described below in connection with fig. 2 and 4). The degree to which a particular audio object is assigned and/or rendered to a particular cluster is generally referred to herein as the "object-to-cluster gain". For example, for audio object j and cluster c, an object-to-cluster gain of 1 indicates that audio object j is fully assigned or rendered to cluster c. As another example, an object-to-cluster gain of 0.5 indicates that audio object j is assigned or rendered to cluster c with a gain of 0.5 and the remaining signals associated with audio object j are rendered to other clusters. As yet another example, an object-to-cluster gain of 0 indicates that audio object j is not assigned or rendered to cluster c.

Fig. 2 illustrates an example of a process 200 for assigning clusters to different rendering metadata categories and assigning audio objects to the assigned clusters, according to some embodiments. The process 200 may be performed on a variety of devices, such as on a server that encodes an audio signal based on audio objects and associated metadata provided by an audio content creator. It should be noted that process 200 generally describes a process with respect to a single frame of audio content. However, it should be understood that in some embodiments, the blocks of process 200 may be repeated for one or more other frames of audio content, for example, to generate a complete output audio signal that is a compressed version of the input audio signal. In some implementations, one or more blocks of process 200 may be omitted. Additionally, in some implementations, two or more blocks of process 200 may be performed substantially in parallel. The blocks of process 200 may be performed in any order, not limited to the order shown in fig. 2.

Process 200 may begin at 202 with the identification of a set of audio objects, where each audio object is associated with spatial location metadata and rendering metadata. The audio objects of the set of audio objects may be identified for a particular frame of the input audio signal. The audio objects may be identified by, for example, accessing a list or table associated with frames of the input audio signal. The spatial location metadata may indicate spatial location information (e.g., locations in 3D space) associated with rendering of the audio object. For example, the spatial location information may indicate azimuth and/or elevation locations of the audio object. As another example, the spatial location information may indicate a spatial location in cartesian coordinates (e.g., (x, y, z) coordinates). The rendering metadata may indicate the manner in which the audio object is to be rendered.

At 204, process 200 may assign each audio object to a rendering metadata category. Example rendering metadata categories for the headphone rendering mode include a "bypass mode" category for rendering metadata and a "virtualization mode" category for rendering metadata. Example rendering metadata categories for speaker rendering modes include a "capture mode" category for rendering metadata and a "region mask" category for rendering metadata. Within the rendering metadata category, rendering metadata may be associated with a rendering metadata type.

In some implementations, the at least one rendering metadata category may include one or more (e.g., two, three, five, ten, etc.) rendering metadata types. In the headphone rendering mode, example rendering metadata types within the "virtualization mode" category of rendering metadata include "near", "medium", and "far" virtualization. It should be noted that the rendering metadata types within the "virtualization mode" category of rendering metadata may indicate a particular HRTF to be applied to an audio object to produce the virtualization indicated in the rendering metadata. For example, rendering metadata corresponding to "near" virtualization may specify that a first HRTF is to be used, while rendering metadata corresponding to "mid" virtualization may specify that a second HRTF is to be used. Example rendering metadata types within the "capture" category of rendering metadata may include a particular identifier indicating whether a binary value of capture is to be enabled and/or a speaker to which the audio object is to be rendered (e.g., a "left speaker," "right speaker," or any other particular speaker). Example rendering metadata types within the "region mask" category of rendering metadata include "left and right surround sound," "left and right speakers," or any other suitable combination of speakers that indicate one or more speakers to be included within or excluded from rendering an audio object.

At 206, process 200 may determine an assignment of clusters to each rendering metadata category. The process 200 may determine the assignment of clusters to each rendering metadata category such that a number of clusters assigned to each category optimally contain audio objects in the set of audio objects identified at block 202 and subject to any suitable constraints. For example, process 200 may determine the allocation of clusters such that the total number of clusters under all rendering metadata categories is less than or equal to a predetermined maximum number of clusters (denoted generally herein as M _total ). In some embodiments, the predetermined maximum number of clusters under all rendering metadata categories may be determined based on various criteria or requirements, such as based on the bandwidth required to transmit the encoded audio signal having the predetermined maximum number of clusters.

As another example, process 200 may determine the cluster allocation by iteratively optimizing the cluster allocation based at least in part on a cost function associated with the audio object to be assigned to each cluster. In some embodiments, the cost function may represent various criteria such as the distance of an audio object assigned to a particular cluster from the centroid of the cluster, the loudness of an audio object relative to an expected loudness of the audio object (e.g., as indicated by the audio content creator) when rendered to the particular cluster, and so forth. Various criteria that may be incorporated into the cost function are described in more detail below in connection with fig. 3. In some implementations, clusters may be assigned based on the assumption that audio objects assigned to a particular class are not allowed to be assigned to clusters assigned to a different class. It should be noted that an example of a process for determining the assignment of clusters of audio objects to each rendering metadata category is shown in fig. 3 and described below in connection with fig. 3.

At 208, process 200 may assign and/or render audio objects to the assigned clusters based on the spatial location metadata and the assignment of the audio objects to the rendering metadata categories. Assigning and/or rendering audio objects to an assigned cluster based on spatial location metadata may involve assigning audio objects to clusters based on spatial locations of the audio objects relative to spatial locations (e.g., elevation and/or azimuth locations, cartesian coordinate locations, etc.) of the assigned cluster. For example, in some embodiments, process 200 may assign and/or render audio objects to assigned clusters based on spatial location metadata and based on the centroid of each assigned cluster such that audio objects having similar spatial locations are assigned to the same cluster. In some embodiments, the similarity of the spatial locations of the audio objects may be determined based on a distance (e.g., euclidean distance, etc.) between the spatial locations indicated in the spatial location metadata associated with the audio objects and the centroids of the clusters.

Assigning and/or rendering audio objects to assigned clusters based on their assignment to rendering metadata categories may involve maintaining rendering metadata categories by assigning audio objects to clusters associated with the same rendering metadata category. For example, in some embodiments, process 200 may assign audio objects to the assigned clusters such that audio objects assigned to a first rendering metadata category (e.g., a "bypass mode") are prohibited from being assigned and/or rendered to clusters assigned to a second rendering metadata category (e.g., a "virtualization mode"), as shown in fig. 1A and described above in connection with fig. 1. In some implementations, assigning and/or rendering audio objects to assigned clusters based on their assignment to rendering metadata categories may involve allowing audio objects to be assigned to clusters associated with different rendering metadata categories. For example, in some embodiments, process 200 may assign and/or render audio objects to assigned audio object clusters such that audio objects assigned to a first rendering metadata category (e.g., "bypass mode") allow assignment to audio object clusters assigned to a second rendering metadata category (e.g., "virtualization mode"), as shown in fig. 1B and described above in connection with fig. 1B. For example, cross-class assignment of audio objects may be desirable where cross-class assignment of audio objects reduces spatial distortion (e.g., due to the position of audio object clusters relative to the position of the audio objects). It should be noted that cross-class assignment of audio objects may introduce a timbre change in the perceived quality of the audio objects when rendered to clusters of audio objects associated with different rendering metadata classes. As another example, in some embodiments, process 200 may assign audio objects such that clustering of audio objects associated with a first rendering metadata type (e.g., a "near" virtualization) and other audio objects associated with a second rendering metadata type (e.g., a "medium" virtualization) within a particular rendering metadata class is allowed, as shown with respect to class 104 in fig. 1A and 1B. It should be noted that an example process for assigning and/or rendering audio objects to assigned audio object clusters subject to various constraints is shown in fig. 4 and described below in connection with fig. 4.

Assigning and/or rendering audio objects to a particular cluster may include determining an audio object-to-cluster gain that indicates that the object when rendered as part of the audio object clusterA gain applied thereto. For a particular audio object j and audio object cluster c, the audio object-to-cluster gain is generally represented herein asAs described above, it should be noted that audio object j may be rendered to a plurality of audio object clusters, wherein the audio object-to-cluster gains for a particular audio object j and a particular cluster c indicate gains applied to the audio object when rendering the audio object j as part of the cluster c. In some embodiments, gain +.>May be in the range of 0 to 1, where the value indicates the ratio of the input audio signal for audio object j to be applied when rendering audio object j to audio object cluster c. In some implementations, the sum of the gains of a particular audio object j is 1 over all clusters c, which indicates that the entirety of the input audio signal associated with audio object j must be distributed over all clusters.

FIG. 3 illustrates an example of a process 300 for generating cluster assignments across multiple rendering metadata categories, according to some embodiments. The blocks of process 300 may be implemented on any suitable device, such as on a server that generates an encoded audio signal based on audio objects included in an input audio signal. It should be noted that process 300 generally describes a process with respect to a single frame of audio content, however, it should be understood that in some embodiments, the blocks of process 300 may be repeated for one or more other frames of audio content, for example, to obtain a cluster allocation of multiple frames of audio content. In some implementations, one or more blocks of process 300 may be omitted. Additionally, in some implementations, two or more blocks of process 300 may be performed substantially in parallel. The blocks of process 300 may be performed in any order, not limited to the order shown in fig. 3.

In general, process 300 may begin with an initial assignment of clusters to rendering metadata categories. In some implementations, the process 300 may iteratively loop through block 304-3 described below18 to optimally assign clusters to rendering metadata categories after starting with the initial assignment. In some implementations, the allocation may be optimized by minimizing a global cost function that combines the cost functions of each rendering metadata category. The cost function of rendering metadata classes is generally referred to herein as an "intra-class cost function". The intra-class cost function of the rendering metadata class may indicate a cost associated with assigning the audio object to the particular cluster assigned to the rendering metadata class during the current iteration through blocks 304-318. In some implementations, the intra-category cost function may be based on a corresponding intra-category penalty function, as described below in connection with block 314. The intra-category penalty function may depend on one or more intra-category penalty terms, as described below in connection with blocks 304-310. The penalty term within each category may in turn depend on the audio object-to-cluster gain for a particular audio object j and cluster c, denoted generally herein The object-to-cluster gain may be determined by minimizing a total intra-group penalty function for a particular rendering metadata class (e.g., as described below in connection with block 312), where the total intra-group penalty function associated with the class is the sum of the intra-class penalty terms. In other words, the process 300 may determine, via blocks 304-312 of the process 300, an object-to-cluster gain that minimizes the intra-class penalty function for each rendering metadata class for the current allocation of clusters to rendering metadata classes during the current iteration through blocks 304-318. The object-to-cluster gains may be used to determine an intra-class cost function for each rendering metadata class. The intra-category cost functions may then be combined to generate a global cost function. Clusters can then be reassigned by minimizing the global cost function.

Process 300 may begin at 302 with determining an initial assignment of clusters to rendering metadata categories, where each rendering metadata category is assigned a subset of clusters. In some embodiments, clusters may be allocated such that the total number of allocated clusters is less than or equal to a predetermined maximum number of clusters, generally denoted herein as M _total . For example, in the case where the first rendering metadata category is assigned M clusters and the second rendering metadata category is assigned n clusters, m+n+.m _total 。M _total The determination may be based on any suitable criteria such as the total number of audio objects to be clustered, the available bandwidth for transmitting the encoded audio signal based on the clustered audio objects, etc. For example, M can be determined _total So as to have M _total The bandwidth of the clusters for transmitting the encoded audio signal is smaller than the threshold bandwidth. In some implementations, each rendering metadata category may be assigned at least one cluster.

The process 300 may determine a centroid for each initially assigned cluster. For example, in some implementations, the centroid of a cluster may be determined based on the perceptually most significant audio object assigned to the rendering metadata category associated with the cluster. As a more specific example, for a first rendering metadata category (e.g., a "bypass mode") that is initially assigned m clusters, the centroid of each of the m clusters may be determined based at least in part on the perceived saliency of the audio object assigned to the first rendering metadata category. For example, in some implementations, m perceptually most significant audio objects assigned to a first rendering metadata category may be identified. The m perceptually most significant audio objects may be identified based on various criteria, such as their loudness, spatial distance from other audio objects assigned to the first rendering metadata category, timbre differences associated with audio objects in the first rendering metadata category, and so forth. In some implementations, the perceived saliency of audio objects can be determined based on differences between the audio objects. For example, for an audio object that includes speech content, where the speech content associated with the two audio objects is in different languages, the two audio objects may be determined to be perceptually significant to each other. The centroids of clusters of audio objects assigned to each rendering metadata category may be determined in a similar manner.

At 304, the process 300 may generate, for each rendering metadata category, a first intra-category penalty term indicating a difference between a location of an audio object assigned or rendered to an initially assigned audio object cluster in the category and a location (e.g., centroid location) of the initially assigned audio object cluster.

The position of audio object j is generally referred to herein as p _j . In some implementations, the location of the audio object j is specified by the audio content creator. The position of cluster c is generally referred to herein as p _c . The location of cluster c may indicate the location of the centroid of cluster c, as described above in connection with block 302.

The reconstructed position of the audio object j after being rendered into one or more clusters is generally referred to herein asFor calculating->An example of the equation for (a) is given by:

in some embodiments, p _j 、p _c Andmay be a three-dimensional vector representing the spatial position of the audio object j when rendered into one or more clusters. The spatial location may be represented in cartesian coordinates.

The first intra-category penalty term may indicate an aggregate difference between a location of the audio object when assigned or rendered to one or more clusters and an original location of the audio object (generally referred to herein as E _p ). An example equation for determining a first intra-category penalty term that indicates an aggregate difference between a location of an audio object when rendered into one or more clusters and an original location of the audio object is given by:

it should be noted that with respect to the first in-class penalty term described above and other in-class penalty terms described below in connection with blocks 306-310, these in-class penalty terms are generally described with respect to a single audio object j. An intra-class penalty term may be calculated for each audio object, and a sum may be calculated over all audio objects assigned to a particular rendering metadata class.

At 306, the process 300 may generate, for each rendering metadata category, a second intra-category penalty term indicating a distance between an audio object assigned or rendered to an initially assigned cluster in the category and the cluster in the category. The second category of intra-cost is generally referred to herein as E _D . Cost in second category E _D The determination may be based on a distance measurement between the audio object j and the cluster c to which the audio object j is assigned. For calculating E _D An example equation for (2) is given by:

in the above equation, the data of the equation,representing the distance between the position of the audio object j and the position of cluster c. Since audio objects positioned in the left region will generate perceptual artifacts when rendered to clusters in the right region (or vice versa), the distance between the position of audio object j and the position of cluster c is a modified distance that effectively penalizes assigning audio object j to cluster c positioned in different azimuthal hemispheres when binaural rendering is performed. An example equation for calculating the modified distance between the audio object j and cluster c is given by:

In the above equation, Λ may represent a 3×3 diagonal matrix, which is given by:

in the above formula lambda _xx May be different depending on whether the positions of the audio object j and the cluster c are located in different left/right regions. For determining lambda _xx An example of an equation for the value of (2) is given by:

in the above formula, x _j And x _c The x-coordinates of the audio object position and the cluster position are represented, respectively. In the above formula, a is a constant between 0 and 1.

At 308, the process 300 may generate, for each rendering metadata category, a third intra-category penalty term indicating a loudness hold of the audio object when assigned or rendered to the respective clusters assigned to the rendering metadata category. In other words, the third intra-class penalty term may indicate a change in energy or amplitude of the audio object when rendered to the respective cluster, wherein the energy or amplitude is perceived by the listener as loudness. Thus, by minimizing the penalty term within the third category, perceived artifacts introduced by rendering audio objects having an enhanced or reduced amplitude (and thus enhanced or reduced loudness) may be minimized. The penalty term within the third category is generally referred to herein as E _N . An example of an equation for calculating the penalty term in the third category is given by:

In some implementations, at 310, the process 300 can generate a fourth intra-category penalty term that indicates a mismatch between the rendering metadata type associated with the audio object and the rendering metadata type of the cluster to which the audio object is assigned or rendered. It should be noted that block 310 may be omitted for categories that do not include multiple rendering metadata types within the rendering metadata category. For example, for the "bypass mode" category of rendering metadata, the penalty term within the fourth category may not be calculated.

As an example, in the case of headphone rendering, the fourth intra-class penalty term may indicate a mismatch between a virtualization type (e.g., "near," "middle," or "far") associated with the "virtualization mode" class of rendering metadata of the audio object and a virtualization type of one or more clusters to which the audio object is assigned or rendered. In effect, the fourth intra-category penalty term may penalize, for example, assigning audio objects having a particular virtualization type (e.g., "near," "medium," or "far") to clusters associated with a different virtualization type. In some implementations, the penalty amount may depend on the distance between different virtualization types. For example, assigning a first audio object having a "near" virtualization type to a cluster associated with a "far" virtualization type may be associated with a greater penalty than assigning a second audio object having a "near" virtualization type to a cluster associated with a "medium" virtualization type. For computing penalty entries in the fourth category (generally referred to herein as E _G ) Is exemplified by the equation:

in the equation given above, U _HRM(j)HRM(c) An element of the matrix U may be represented that defines penalty weights for various combinations of virtualization types for the audio object j and cluster c. Each row of matrix U may indicate a virtualization type associated with an audio object, and each column of matrix U may indicate a virtualization type associated with a cluster to which the audio object has been assigned or rendered. For example, matrix element [ HRM (j), HRM (c)]The penalty weight of the audio object j of the virtualization type indicated by HRM (j) when assigned or rendered to cluster c with the virtualization type HRM (c) may be indicated. In some embodimentsThe matrix U may be symmetric such that the same penalty weights are used for audio objects of a first virtualization type when assigned or rendered to clusters of a second virtualization type and for audio objects of a virtualization type when assigned or rendered to clusters of the first virtualization type. In some implementations, the diagonal of the matrix U can be 0, which indicates the similarity of the virtualization type associated with the audio object and the virtualization type associated with the cluster. Specific examples of matrices U that may be used are:

At 312, process 300 may determine an object-to-cluster gain for each audio object and cluster assigned to the rendering metadata class associated with the audio object. The object-to-cluster gains may be determined by minimizing a class penalty function corresponding to the class of rendering metadata associated with the audio object. For example, for an audio object associated with a "bypass mode" category of rendering metadata, an object-to-cluster gain for the audio object may be determined for one or more clusters assigned to the "bypass mode" category of rendering metadata. As another example, for an audio object associated with a "virtualization mode" category of rendering metadata, an object-to-cluster gain for the audio object may be determined for one or more clusters assigned to the "virtualization mode" category of rendering metadata.

The class penalty function for a particular rendering metadata class may be determined as the sum (e.g., weighted sum) of penalty terms within any of the classes determined at blocks 304-310. For example, in some implementations, the class penalty function of the "virtualization mode" class of rendering metadata may be a weighted sum of the first class penalty term determined at block 304, the second class penalty term determined at block 306, the third class penalty term determined at block 308, and/or the fourth class penalty term determined at block 310. An example of an equation for a class penalty function that is a weighted sum of the intra-class penalty terms determined at blocks 304-310 (and may be used in some implementations as a class penalty function for the "virtualization mode" class of rendering metadata) is given by:

E _cat1 ＝w _P E _P +w _D E _D +w _N E _N +w _G E _G

In some implementations, a class penalty function may be calculated that does not include a penalty term indicating a mismatch between the rendering metadata type associated with an audio object and the rendering metadata type of the cluster to which the audio object is assigned or rendered. For example, such a class penalty function may be determined for the "bypass mode" class. In some implementations, such a class penalty function may be a weighted sum of the first class penalty term determined at block 304, the second class penalty term determined at block 306, and/or the third class penalty term determined at 308. An example of an equation for the class penalty function that is a weighted sum of the intra-class penalty terms determined at blocks 304-308 (and may be used in some implementations as a class penalty function for the "bypass mode" class of rendering metadata) is given by:

E _cat2 ＝w _P E _P +w _D E _D +w _N E _N

it should be noted that the penalty function E is given above for calculating the category _cat2 In the example of (2), the class penalty function may be determined by adding a penalty term E in the fourth class _G Set to 0 to penalize function E from category _cat1 Obtained.

It should be noted that the example class penalty functions described above are merely illustrative. In some implementations, the class penalty function may be any suitable weighted sum of the class penalty terms, such as a weighted sum of the first class penalty terms and the second class penalty terms, a weighted sum of the second class penalty terms and the fourth class penalty terms, and so on.

As described above, for a given audio object j associated with a particular rendering metadata category, the object-to-cluster gain may be determined by minimizing a category penalty function associated with the rendering metadata categoryThese object-to-cluster gains indicate the gain of the audio object j when rendered to one or more clusters (e.g., indicated as elements of the vector). For example, for audio objects associated with a "bypass mode" class of rendering metadata, the "bypass mode" class penalty function (e.g., E in the equation above) may be minimized _cat2 ) To determine the object-to-cluster gain. The gain vector of the audio object j may be calculated by minimizing the associated class penalty function E, referred to asFor example, the equation +.>Where E is a class cost function of the rendering metadata class associated with the audio object j.

At 314, process 300 may calculate, for each rendering metadata category, an intra-category cost function based on the object-to-cluster gains of the audio objects associated with the rendering metadata category. In some implementations, the intra-class cost function may be determined based on the loudness of the audio objects within the rendering metadata class. Additionally or alternatively, in some embodiments, the penalty function may be based on a corresponding intra-category (e.g., E as described above _cat1 And/or E _cat2 Etc.) to determine cost functions within the category. An example equation for calculating an intra-category cost function determined based on the intra-category penalty function E is given by:

in the equation given above, N' _j Representing the partial loudness of the audio object j. It should be noted that the intra-category cost function may be based at least in part on any combination of: 1) The locations of the audio object clusters relative to the locations of the audio objects assigned to the audio object clusters (e.g., based on the first intra-category penalty term described above at block 304); 2) Relative to soundLeft-right placement of the audio object of the left-right placement of the cluster to which the frequency object has been assigned (e.g., based on the second intra-category penalty term described above at block 306); 3) The distance of the audio object to the cluster to which the audio object has been assigned (e.g., based on the second intra-category penalty term described above at block 306); 4) Loudness of audio objects (e.g., based on the third intra-category penalty term described above at block 308); and/or 5) a similarity of a rendering metadata type associated with the audio object and a rendering metadata type associated with the cluster to which the audio object has been assigned (e.g., based on the fourth intra-category penalty term described above at block 310).

In some implementations, the intra-class cost function may be determined as a loudness weighted sum of the positional differences between the audio objects and the clusters. An example equation for calculating the intra-category cost function based on the location difference is given by:

it should be noted that intra-class cost functions may be determined for each rendering metadata class. For example, a cost function/within a first class ₁ May be determined for a "virtualization mode" class of rendering metadata, and a cost function within a second class, i ₂ May be determined for a "bypass mode" category of rendering metadata. Similarly, when audio objects are clustered for rendering in speaker rendering mode, intra-class cost functions for region mask class, capture class, etc. may be calculated.

At 316, process 300 may calculate a global cost function that combines class cost functions across different rendering metadata classes. For example, the global cost function may compare a first class cost function associated with a "virtualization mode" class of rendering metadata (e.g., l in the example given above ₁ ) And a second class cost function associated with the "bypass mode" class of rendering metadata (e.g., l in the example given above ₂ ) And combining. For computing globalAn example equation for a cost function (generally referred to herein as l _global ) Given by the formula:

l _global ＝al ₁ +(1-a)l ₂

in the equation given above, a is a weighting constant indicating the weight or importance of each rendering metadata category.

At 318, the process 300 may reassign clusters to rendering metadata categories based at least in part on the global cost function determined at block 316. For example, in some implementations, the process 300 may minimize the global cost function/by selecting for each category _global To reassign clusters. As a more specific example, in some implementations, the process 300 may select the number of clusters to be assigned to a first rendering metadata category as m and the number of clusters to be assigned to a second rendering metadata category as n.

In some implementations, the number of clusters in the current frame to be assigned to a particular rendering metadata category may be different from the number of clusters in the previous frame to be assigned to a particular rendering metadata category (e.g., as a result of applying process 300 to the previous frame). In some embodiments, the change in the number of clusters allocated to the current frame relative to the number of clusters allocated to the previous frame may be due to: the number of audio objects indicated in the current frame is different relative to the number of audio objects indicated in the previous frame, the number of active audio objects indicated in the current frame is different relative to the number of active audio objects indicated in the previous frame, and/or the spatial position of the active audio objects across the audio signal frame varies. As an example, m clusters may be allocated to the first rendering metadata category in the current frame, and m' clusters may be allocated to the first rendering metadata category in the previous frame. In case two overlapping signals comprising audio objects assigned to different rendering metadata categories are to be added in the current frame and in case there are no free clusters available in the current frame for allocation to the first category, rendering may introduce artifacts. Adding additional clusters to a particular rendering metadata category by adding additional clusters that were not previously assigned to any rendering metadata category may allow for more accurate clustering of audio objects assigned to a particular rendering metadata category while not introducing rendering artifacts.

In some implementations, given that m 'clusters are assigned to a first rendering metadata category in a previous frame, n' clusters are assigned to a second rendering metadata category in the previous frame, m clusters are assigned to the first rendering metadata category in a current frame, and n clusters are assigned to the second rendering metadata category in the current frame, the increases in clusters in the first rendering metadata category and the second rendering metadata category are given by the following formulas, respectively:

Δm=max (0, m-m ') and Δn=max (0, n-n')

The number of clusters available for allocation to the first class of rendering metadata or the second class of rendering metadata may be given by: m is m _free ＝M _total - (m '+n'). In some implementations, the process 300 may be performed by minimizing l _global (m, n) such that the following formula is satisfied to reassign clusters to the first rendering metadata category and the second rendering metadata category:

m+n≤M _total Δm+Δn is less than or equal to m _free . It should be noted that where cross-class assignment of audio objects is not allowed (e.g., to clusters associated with rendering metadata classes other than the same audio object associated rendering metadata class), the process 300 may reassign clusters under this constraint.

For example, at M _total In the case of 21 (e.g., a maximum of 21 clusters can be assigned to all rendering metadata categories) and m 'is 11 and n' is 10, m _free 0 because M '+n' =m _total . Continuing with this example, process 300 may then determine at block 318: neither m nor n can be increased because there are no clusters available for allocation. As a specific example, if M is set to 13 and n is set to 8 (e.g., to satisfy the criterion m+n+.m) _total ) Then Δm is 2 and Δn is 0. However, since Δm+Δn=2, this is greater than m _free (which is 0), the process 300 may determine 13 a value of m that is not valid for the current frame.

It should be noted thatAlthough the above examples describe two rendering metadata categories, the same techniques may be applied to any suitable number of rendering metadata categories (e.g., three, four, etc.). For example, process 300 may minimize l _global (m _i ) So that is _i m _i ≤M _total And sigma (sigma) _i Δm _i ≤m _free 。

Process 300 may then loop back to block 304. The process 300 may loop through blocks 304-318 until a stopping criterion is reached. Examples of stopping criteria include determining that a minimum of the global cost function determined at block 316 has been reached, that more than a predetermined threshold number of iterations have been performed by blocks 304-318, and so forth. In some implementations, the allocation determined as a result of looping through blocks 304-318 until the stopping criteria is reached may be referred to as an "optimal allocation".

It should be noted that the blocks of process 300 may be performed to determine the assignment of clusters to rendering metadata categories for particular frames of an input audio signal. The blocks of process 300 may be repeated for other frames of the input audio signal to determine an allocation of clusters to rendering metadata categories for the other frames of the input audio signal. For example, in some implementations, the process 300 may repeat the blocks of the process 300 for each frame of the input audio signal, for every other frame of the input audio signal, and so on.

Fig. 4 illustrates an example of a process 400 for rendering audio objects to clusters, according to some embodiments. The blocks of process 400 may be implemented on any suitable device, such as on a server that generates an encoded audio signal based on audio objects included in an input audio signal. It should be noted that process 400 generally describes a process with respect to a single frame of audio content, however, it should be understood that in some embodiments, the blocks of process 400 may be repeated for one or more other frames of audio content, for example, to generate a complete output audio signal as a compressed version of the input audio signal. In some implementations, one or more blocks of process 400 may be omitted. Additionally, in some implementations, two or more blocks of process 400 may be performed substantially in parallel. The blocks of process 400 may be performed in any order, not limited to the order shown in fig. 4.

Process 400 may begin at 402 with obtaining an assignment of clusters to rendering metadata categories. For example, the allocation may indicate that a number of clusters are allocated to each rendering metadata category. As a more specific example, the allocation may indicate that a first number of clusters are allocated to a first rendering metadata category (e.g., a "bypass mode" category of rendering metadata) and a second number of clusters are allocated to a second rendering metadata category (e.g., a "virtualization mode" category of rendering metadata). In speaker rendering mode, other rendering metadata categories may include a "capture" category of rendering metadata, a "region mask" category of rendering metadata, and so forth. In some implementations, the allocation of clusters may also indicate a centroid location for each cluster. In some implementations, the centroid location of each cluster may be used to calculate a penalty function that is used to determine the object-to-cluster gain at block 404.

In some implementations, the assignment of clusters to rendering metadata categories may be the result of an optimization process that is subject to various constraints or criteria (e.g., affected by a maximum number of clusters) to determine an optimal assignment of clusters to rendering metadata categories. An example process for determining the assignment of clusters to rendering metadata categories is shown in fig. 3 and described above in connection with fig. 3.

It should be noted that the assignment of clusters to rendering metadata categories may be specified for individual frames of the input audio signal. For example, the obtained allocation may indicate: m' clusters are assigned to a first rendering metadata category for a first frame of the input audio signal and m clusters are assigned to the first rendering metadata category for a second frame of the input audio signal. The first frame of the input audio signal and the second frame of the input audio signal may or may not be consecutive frames.

At 404, the process 400 may determine, for each audio object in a frame of the input audio signal, an object-to-cluster gain assigned to a cluster of rendering metadata categories associated with the audio object. For example, where an audio object is associated with a "bypass mode" category of rendering metadata and m clusters have been assigned to the "bypass mode" category of rendering metadata, process 400 may determine an object-to-cluster gain for the audio object when rendered to m clusters assigned to the "bypass mode" category of rendering metadata. It should be noted that the object-to-cluster gain for a particular audio object rendered to a particular cluster may be 0, indicating that the audio object is not assigned or rendered to that cluster.

In some implementations, the process 400 can determine the object-to-cluster gain by minimizing a class penalty function for each rendering metadata class individually. It should be noted that determining object-to-cluster gains by minimizing the penalty function for each rendering metadata category individually would prohibit assigning or rendering audio objects associated with a first rendering metadata category to clusters assigned to a second rendering metadata category, wherein the first rendering metadata category is different from the second rendering metadata category. For example, in such an embodiment, audio objects associated with the "bypass mode" category of rendering metadata will be prohibited from being assigned and/or rendered to clusters assigned to the "virtualization mode" category of rendering metadata. An example of such a cluster is shown in fig. 1A and described above in connection with fig. 1A.

In some implementations, the class penalty function may be the class penalty function described in connection with block 312 of fig. 3. For example, in connection with iterations of the blocks of process 300, the class penalty function may be a final class penalty function determined for a final allocation when a stopping criterion is reached. As a specific example, where four intra-category penalty items are determined (e.g., in the case of headphone rendering mode, and for the "virtualization mode" category of rendering metadata), the category penalty function may be (as described in connection with block 312 of fig. 3):

E＝w _P E _P +w _D E _D +w _N E _N +w _G E _G

As another specific example, where three intra-category penalty terms are determined (e.g., in the case of headphone rendering mode, and for the "bypass mode" category of rendering metadata), the category penalty function may be (as described in connection with block 312 of fig. 3):

E＝w _P E _P +w _D E _D +w _N E _N

for example, in the case of a headphone rendering mode, process 400 may determine a first set of object-to-cluster gains for a first set of audio objects associated with a "bypass mode" category by minimizing a first penalty function associated with the "bypass mode" category of rendering metadata and for clusters assigned to the "bypass mode" category (e.g., as indicated in the assignment obtained at block 402). Continuing with this example, process 400 may determine a second set of object-to-cluster gains for a second set of audio objects associated with the "virtualization mode" category by minimizing a second penalty function associated with the "virtualization mode" category of rendering metadata and for clusters assigned to the "virtualization mode" category (e.g., as indicated in the assignment obtained at block 402).

Alternatively, in some implementations, the process 400 may determine the object-to-cluster gain by minimizing a joint penalty function (e.g., taking into account all rendering metadata categories). In such an embodiment, audio objects associated with a first rendering metadata category may be assigned or rendered to clusters assigned to a second rendering metadata category, wherein the first rendering metadata category is different from the second rendering metadata category. For example, in such embodiments, audio objects associated with the "bypass mode" category of rendering metadata may be assigned and/or rendered to clusters assigned to the "virtualization mode" category of rendering metadata. An example of such a cluster is shown in fig. 1B and described above in connection with this figure.

An example equation representing a joint penalty function is:

E＝w′ _P E _P +w′ _D E _D +w′ _N E _N +w′ _G E′ _G

in the above equation, E _P 、E _D And E is _N Representing the first penalty term described in blocks 304, 306, and 308, respectively,A second penalty term and a third penalty term. Thus E is _P 、E _D And E is _N The determination may be made using the techniques described above in connection with blocks 304, 306, and 308 of fig. 3 and considering audio objects and clusters across all rendering metadata categories. Similar to that described above in connection with block 312, w' _P 、w′ _D 、w′ _N And w' _G Representing the relative importance of each penalty term to the overall joint penalty function.

E′ _G The representation is: 1) A penalty associated with assigning or rendering audio objects associated with the first class to mismatches between clusters assigned to the second rendering metadata class; and 2) a penalty associated with a mismatch between the rendering metadata type of the audio object and the rendering metadata type of the cluster to which the audio object is assigned or rendered (wherein the rendering metadata type of the audio object and the cluster are in the same rendering metadata class). For example, in the case of headphone rendering, E' _G A penalty may be indicated for assigning and/or rendering audio objects associated with the "bypass mode" category of rendering metadata to the "virtualization mode" category of rendering metadata. Continuing with this example, E' _G A penalty for assigning audio objects associated with a "near" virtualization type to clusters primarily associated with a "medium" or "far" virtualization type may additionally or alternatively be indicated. For determining E' _G An example equation for (2) is given by:

in the above equation, U represents a matrix indicating a penalty for assigning and/or rendering audio objects j associated with rendering mode (j) to clusters associated with rendering mode (c). For example, in the case of headphone rendering, examples of modes (e.g., example values for mode (j) and mode (c)) may include "bypass mode", "near" virtualization, "medium" virtualization, and "far" virtualization. In the case of headphone rendering, U may be a 4 x 4 matrix, where the rows indicate the mode associated with the audio object and the columns indicate the mode associated with the cluster to which the audio object is assigned or rendered. As a more specific example, in some embodiments, the first three rows and the first three columns of U may correspond to different virtualization types (e.g., "near," "middle," and "far"), and the fourth row and the fourth column of U may correspond to bypass modes. Examples of such a matrix U are:

As shown in the example matrix U above, audio objects associated with the "bypass mode" category of rendering metadata may be severely penalized when assigned to clusters assigned to the "virtualization mode" category of rendering metadata (as indicated by 1 in the last row of U). Similarly, audio objects associated with any type of "virtualization mode" class of rendering metadata (e.g., any of the "near", "middle", and/or "far" virtualization types) may be severely penalized when assigned to clusters assigned to the "bypass mode" class of rendering metadata (as indicated by 1 in the last column of U). In other words, cross-class assignment or rendering of audio objects is relatively more penalized than other rendering metadata types that assign or render the audio objects into the same rendering metadata class. For example, an audio object associated with a "near" virtualization type may be assigned to a cluster associated with a "middle" virtualization type, where the penalty is 0.3, may be assigned to a cluster associated with a "far" virtualization type, where the penalty is 0.7, and may be assigned to a cross-class cluster associated with "bypass mode" rendering metadata, where the penalty is 1.

At 406, the process 400 may generate an output audio signal based on the object-to-cluster gain (e.g., as determined at block 404) for each audio object. Outputting the audio signal may include each audio object assigned or rendered to one or more clusters according to the object-to-cluster gain determined for each audio object. Output for generating a particular cluster cAudio signals (generally referred to herein as I _out,c ) Is:

as indicated in the above equation, the audio signal I is input _in,j The j audio object clusters indicated in (1) are iteratively traversed and each based on object-to-cluster gainsIs rendered to one or more clusters c.

It should be noted that the blocks of process 400 may be repeated for one or more other frames of the input audio signal such that audio objects indicated in the one or more other frames of the input audio signal are assigned or rendered to respective clusters, thereby generating a complete output audio signal comprising a plurality of frames of the input audio signal (e.g., all frames of the input audio signal). In some implementations, the complete output audio signal may be saved, transmitted to a device (e.g., a user device such as a mobile device, television, speaker, etc.) for presentation, etc.

Fig. 5 is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the disclosure. As with the other figures provided herein, the types and numbers of elements shown in fig. 5 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, the apparatus 500 may be configured to perform at least some of the methods disclosed herein. In some implementations, the apparatus 500 may be or include one or more components of a television, an audio system, a mobile device (such as a cellular telephone), a laptop computer, a tablet device, a smart speaker, or another type of device.

According to some alternative embodiments, the apparatus 500 may be or may include a server. In some such examples, the apparatus 500 may be or may include an encoder. Thus, in some cases, the apparatus 500 may be a device configured for use within an audio environment, such as a home audio environment, while in other cases, the apparatus 500 may be a device configured for use in a "cloud", e.g., a server.

In this example, the apparatus 500 includes an interface system 505 and a control system 510. In some implementations, the interface system 505 can be configured to communicate with one or more other devices of the audio environment. In some examples, the audio environment may be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. In some implementations, the interface system 505 can be configured to exchange control information and associated data with an audio device of an audio environment. In some examples, the control information and associated data may relate to one or more software applications being executed by the apparatus 500.

In some implementations, the interface system 505 can be configured to receive a content stream or to provide a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some cases, the audio data may include spatial data such as channel data and/or spatial metadata. In some examples, the content stream may include video data and audio data corresponding to the video data.

The interface system 505 may include one or more network interfaces and/or one or more external device interfaces (e.g., one or more Universal Serial Bus (USB) interfaces). According to some embodiments, the interface system 505 may include one or more wireless interfaces. The interface system 505 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, interface system 505 may include one or more interfaces between control system 510 and a memory system (such as optional memory system 515 shown in fig. 5). However, in some cases, control system 510 may include a memory system. In some implementations, the interface system 505 may be configured to receive input from one or more microphones in an environment.

For example, control system 510 may include a general purpose single or multi-chip processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.

In some implementations, the control system 510 may reside in more than one device. For example, in some implementations, a portion of the control system 510 may reside in a device within one of the environments depicted herein, and another portion of the control system 510 may reside in a device outside of the environment, such as in a server, mobile device (e.g., smart phone or tablet computer), or the like. In other examples, a portion of control system 510 may reside in a device within an environment, and another portion of control system 510 may reside in one or more other devices of the environment. For example, a portion of control system 510 may reside in a device (e.g., a server) that implements a cloud-based service, and another portion of control system 510 may reside in another device (e.g., another server, a memory device, etc.) that implements the cloud-based service. In some examples, the interface system 505 may also reside in more than one device.

In some implementations, the control system 510 may be configured to at least partially perform the methods disclosed herein. According to some examples, control system 510 may be configured to implement a method of clustering audio objects.

Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. For example, one or more non-transitory media may reside in the optional memory system 515 and/or the control system 510 shown in fig. 5. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. The software may, for example, include instructions for determining assignment of clusters to various rendering metadata categories, assigning or rendering audio objects to assigned clusters, and so forth. The software may be executable by one or more components of a control system, such as control system 510 of fig. 5, for example.

In some examples, the apparatus 500 may include an optional microphone system 520 shown in fig. 5. Optional microphone system 520 may include one or more microphones. In some implementations, one or more microphones may be part of or associated with another device (e.g., a speaker of a speaker system, a smart audio device, etc.). In some examples, the apparatus 500 may not include the microphone system 520. However, in some such embodiments, the apparatus 500 may still be configured to receive microphone data for one or more microphones in an audio environment via the interface system 510. In some such embodiments, a cloud-based embodiment of the apparatus 500 may be configured to receive microphone data or noise indicia corresponding at least in part to microphone data from one or more microphones in an audio environment via the interface system 510.

According to some embodiments, the apparatus 500 may include an optional loudspeaker system 525 shown in fig. 5. Optional microphone system 525 may include one or more microphones, which may also be referred to herein as "speakers," or more generally as "audio reproduction transducers. In some examples (e.g., cloud-based implementations), the apparatus 500 may not include the loudspeaker system 525. In some embodiments, the apparatus 500 may comprise headphones. Headphones may be connected or coupled to device 500 via a headphone jack or via a wireless connection (e.g., bluetooth).

Aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer-readable medium (e.g., disk) storing code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems may be or include a programmable general purpose processor, digital signal processor, or microprocessor programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including embodiments of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.

Some embodiments may be implemented as a configurable (e.g., programmable) Digital Signal Processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform the required processing on one or more audio signals, including the performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general-purpose processor (e.g., a Personal Computer (PC) or other computer system or microprocessor, which may include an input device and memory) programmed and/or otherwise configured with software or firmware to perform any of a variety of operations, including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general-purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more microphones and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or keyboard), memory, and a display device.

Another aspect of the disclosure is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code (e.g., an encoder executable to perform one or more examples of the disclosed methods or steps thereof) for performing one or more examples of the disclosed methods or steps thereof.

While specific embodiments of, and applications for, the present disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many more modifications than mentioned herein are possible without departing from the scope of the disclosure described and claimed herein. It is to be understood that while certain forms of the disclosure have been illustrated and described, the disclosure is not to be limited to the specific embodiments described and illustrated or to the specific methods described.

Enumerated example embodiments:

example 1. A method for clustering audio objects, the method comprising: identifying a plurality of audio objects, wherein the audio objects are associated with metadata indicating spatial location information and rendering metadata; assigning audio objects of the plurality of audio objects to rendering metadata categories of a plurality of rendering metadata categories, wherein at least one rendering metadata category comprises a plurality of rendering metadata types to be maintained; determining an allocation of a plurality of audio object clusters to each rendering metadata category, wherein an audio object cluster comprises one or more audio objects of the plurality of audio objects having similar attributes; audio objects of the plurality of audio objects are rendered to an assigned plurality of audio object clusters based on the metadata indicating spatial location information and based on the assignment of the audio objects to the rendering metadata categories.

Example 2. The method of example 1, wherein the rendering metadata category comprises a bypass mode category and a virtualization category.

Example 3 the method of example 2, wherein the plurality of rendering metadata types included in the virtualization category includes a plurality of virtualization types, each representing a distance from a head center to the audio object.

Example 4. The method of example 1, wherein the rendering metadata category comprises one of a region category or a capture category.

Example 5 the method of any of examples 1 to 4, wherein audio objects assigned to a first rendering metadata category are prohibited from being assigned to an audio object cluster of the plurality of audio object clusters that is assigned to a second rendering metadata category.

Example 6 the method of any of examples 1-5, further comprising transmitting an audio signal comprising spatial information and gain information associated with each of the allocated plurality of audio object clusters, wherein the audio signal has less spatial distortion than an audio signal comprising spatial information and gain information associated with an audio object cluster in which an audio object assigned to the first rendering metadata category is assigned to an audio object cluster associated with the second rendering metadata category.

Example 7 the method of any one of examples 1 to 6, wherein determining the allocation of the plurality of audio object clusters to each rendering metadata category comprises: (i) Determining an initial allocation of an initial plurality of audio object clusters to each rendering metadata category; (ii) Assigning the audio objects to the initial plurality of audio object clusters based on the metadata indicating spatial location information and based on the assignment of the audio objects to the rendering metadata categories; (iii) Determining, for each rendering metadata category, a category cost for assigning the audio object to the initial plurality of audio object clusters; (iv) Determining an updated allocation of the initial plurality of audio object clusters to each rendering metadata category based at least in part on the category cost of each rendering metadata category; and (iv) repeating (ii) to (iv) until a stopping criterion is reached.

Example 8 the method of example 7, wherein determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on a location of an audio object cluster assigned to the rendering metadata class and a location of an audio object assigned to an audio object cluster assigned to the rendering metadata class.

Example 9. The method of example 8, wherein the category cost is based on a left-to-right placement of the audio object relative to a left-to-right placement of an audio object cluster to which the audio object has been assigned.

Example 10 the method of any of examples 7-9, wherein determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on a loudness of the audio object.

Example 11 the method of any of examples 7-10, wherein determining the category cost to assign the audio object to the initial plurality of audio object clusters is based on a distance of an audio object to an audio object cluster to which the audio object has been assigned.

Example 12 the method of any of examples 7-11, wherein determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on a similarity of a rendering metadata type of an audio object to a rendering metadata type of an audio object cluster to which the audio object has been assigned.

Example 13 the method of any one of examples 7 to 12, further comprising determining a global cost based on the class cost for each rendering metadata class, wherein the updated allocation of the initial plurality of audio object clusters is based on the global cost.

Example 14 the method of example 12, wherein repeating (ii) through (iv) until the stopping criteria is reached comprises determining that a minimum of the global cost has been achieved.

Example 15 the method of any of examples 7-14, wherein determining the update allocation includes changing a number of audio object clusters allocated to at least one of the plurality of rendering metadata categories.

Example 16 the method of example 15, further comprising determining a global cost based on the class cost for each rendering metadata class, wherein the number of audio object clusters is determined based on the global cost.

Example 17 the method of example 16, wherein determining the number of audio object clusters comprises: the global cost is minimized subject to a constraint on the number of audio object clusters that indicates the maximum number of audio object clusters that can be added.

Example 18 the method of any of examples 1 to 17, wherein rendering audio objects of the plurality of audio objects to the allocated plurality of audio object clusters comprises: an object-to-cluster gain is determined for each of the plurality of audio objects when rendered to one or more audio object clusters assigned to a rendering metadata class to which the audio object is assigned.

Example 19 the method of example 18, wherein the object-to-cluster gain assigned to the audio object of the first one of the plurality of rendering metadata categories is determined independently of the object-to-cluster gain assigned to the audio object of the second one of the plurality of rendering metadata categories.

Example 20 the method of example 18, wherein the object-to-cluster gains assigned to audio objects of a first one of the plurality of rendering metadata categories are determined in conjunction with the object-to-cluster gains assigned to audio objects of a second one of the plurality of rendering metadata categories.

Example 21 the method of any of examples 1-20, further comprising transmitting an audio signal comprising spatial information and gain information associated with each of the allocated plurality of audio object clusters, wherein transmitting the audio signal requires less bandwidth than transmitting an audio signal comprising spatial information and gain information associated with each of the plurality of audio objects.

Example 22. An apparatus configured to implement the method of any one of examples 1-21.

Example 23. A system configured to implement the method of any of examples 1-21.

Example 24. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of examples 1-21.

Claims

1. A method for clustering audio objects, the method comprising:

identifying a plurality of audio objects, wherein an audio object of the plurality of audio objects is associated with respective metadata indicating respective spatial location information and respective rendering metadata;

assigning audio objects of the plurality of audio objects to rendering metadata categories of a plurality of rendering metadata categories, wherein at least one rendering metadata category comprises a plurality of rendering metadata types to be maintained;

determining an allocation of a plurality of audio object clusters to each rendering metadata category, wherein an audio object cluster comprises one or more audio objects of the plurality of audio objects having similar attributes;

audio objects of the plurality of audio objects are rendered to an assigned plurality of audio object clusters based on the metadata indicating spatial location information and based on the assignment of the audio objects to the rendering metadata categories.

2. The method of claim 1, wherein the rendering metadata categories include a bypass mode category and a virtualization category.

3. The method of claim 2, wherein the plurality of rendering metadata types included in the virtualization category includes a plurality of virtualization types, each representing a distance from a head center to the audio object.

4. The method of claim 1, wherein the rendering metadata category comprises one of a region category or a capture category.

5. The method of any of claims 1 to 4, wherein audio objects assigned to a first rendering metadata category prohibit audio object clusters assigned to a second rendering metadata category from among the plurality of audio object clusters.

6. The method of any of claims 1-5, further comprising transmitting an audio signal comprising spatial information and gain information associated with each of the allocated plurality of audio object clusters, wherein the audio signal has less spatial distortion than an audio signal comprising spatial information and gain information associated with an audio object cluster in which an audio object assigned to the first rendering metadata category is assigned to an audio object cluster associated with the second rendering metadata category.

7. The method of any of claims 1 to 6, wherein determining the allocation of the plurality of clusters of audio objects to each rendering metadata category comprises:

(i) Determining an initial allocation of an initial plurality of audio object clusters to each rendering metadata category;

(ii) Assigning the audio objects to the initial plurality of audio object clusters based on the metadata indicating spatial location information and based on the assignment of the audio objects to the rendering metadata categories;

(iii) Determining, for each rendering metadata category, a category cost for assigning the audio object to the initial plurality of audio object clusters;

(iv) Determining an updated allocation of the initial plurality of audio object clusters to each rendering metadata category based at least in part on the category cost of each rendering metadata category; and

(iv) Repeating (ii) through (iv) until a stopping criterion is reached.

8. The method of claim 7, wherein determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on a location of an audio object cluster assigned to the rendering metadata class and a location of an audio object assigned to an audio object cluster assigned to the rendering metadata class.

9. The method of claim 8, wherein the category cost is based on a left-to-right placement of the audio object relative to a left-to-right placement of an audio object cluster to which the audio object has been assigned.

10. The method of any of claims 7 to 9, wherein determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on a loudness of the audio object.

11. The method of any of claims 7 to 10, wherein determining the category cost to assign the audio object to the initial plurality of audio object clusters is based on a distance of an audio object to an audio object cluster to which the audio object has been assigned.

12. The method of any of claims 7 to 11, wherein determining the class cost to assign the audio object to the initial plurality of audio object clusters is based on a similarity of a rendering metadata type of an audio object to a rendering metadata type of an audio object cluster to which the audio object has been assigned.

13. The method of any of claims 7 to 12, further comprising determining a global cost based on the class cost for each rendering metadata class, wherein the update allocation of the initial plurality of audio object clusters is based on the global cost.

14. The method of claim 12, wherein repeating (ii) through (iv) until the stopping criteria is reached comprises determining that a minimum of the global cost has been achieved.

15. The method of any of claims 7 to 14, wherein determining the update allocation comprises changing a number of audio object clusters allocated to at least one of the plurality of rendering metadata categories.

16. The method of claim 15, further comprising determining a global cost based on the class cost for each rendering metadata class, wherein the number of audio object clusters is determined based on the global cost.

17. The method of claim 16, wherein determining the number of audio object clusters comprises: the global cost is minimized subject to a constraint on the number of audio object clusters that indicates the maximum number of audio object clusters that can be added.

18. The method of any of claims 1 to 17, wherein rendering audio objects of the plurality of audio objects to the assigned plurality of audio object clusters comprises: an object-to-cluster gain is determined for each of the plurality of audio objects when rendered to one or more audio object clusters assigned to a rendering metadata class to which the audio object is assigned.

19. The method of claim 18, wherein the object-to-cluster gain assigned to the audio object of the first one of the plurality of rendering metadata categories is determined independently of the object-to-cluster gain assigned to the audio object of the second one of the plurality of rendering metadata categories.

20. The method of claim 18, wherein the object-to-cluster gains assigned to audio objects of a first one of the plurality of rendering metadata categories are determined in conjunction with the object-to-cluster gains assigned to audio objects of a second one of the plurality of rendering metadata categories.

21. The method of any of claims 1-20, further comprising transmitting an audio signal comprising spatial information and gain information associated with each of the allocated plurality of audio object clusters, wherein transmitting the audio signal requires less bandwidth than transmitting an audio signal comprising spatial information and gain information associated with each of the plurality of audio objects.

22. An apparatus configured to implement the method of any one of claims 1 to 21.

23. A system configured to implement the method of any one of claims 1 to 21.

24. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of claims 1-21.