CN106385660B - Processing object-based audio signals - Google Patents

Processing object-based audio signals Download PDF

Info

Publication number
CN106385660B
CN106385660B CN201510484949.8A CN201510484949A CN106385660B CN 106385660 B CN106385660 B CN 106385660B CN 201510484949 A CN201510484949 A CN 201510484949A CN 106385660 B CN106385660 B CN 106385660B
Authority
CN
China
Prior art keywords
cluster
audio
gains
gain
positions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510484949.8A
Other languages
Chinese (zh)
Other versions
CN106385660A (en
Inventor
陈连武
芦烈
J·布里巴特
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Priority to CN201510484949.8A priority Critical patent/CN106385660B/en
Priority to PCT/US2016/045512 priority patent/WO2017027308A1/en
Priority to US15/749,750 priority patent/US10277997B2/en
Priority to EP16751763.0A priority patent/EP3332557B1/en
Publication of CN106385660A publication Critical patent/CN106385660A/en
Application granted granted Critical
Publication of CN106385660B publication Critical patent/CN106385660B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

Example embodiments disclosed herein relate to audio signal processing. The audio signal has a plurality of audio objects. A method of processing an audio signal is disclosed. The method comprises obtaining an object position for each audio object; and determining cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The method also includes determining an object-to-cluster gain based on the set of object locations, cluster locations, and metrics; and generating a cluster signal based on the determined cluster position and the object-to-cluster gain. Corresponding systems and computer program products are also disclosed.

Description

Processing object-based audio signals
Technical Field
Example embodiments disclosed herein relate generally to object-based audio processing and, more particularly, to a method and system for generating a cluster signal from an object-based audio signal.
Background
Traditionally, audio content in a multi-channel format (e.g. 5.1, 7.1, etc.) is created by mixing different audio signals in a studio or generated by simultaneously recording acoustic signals in a real environment. Recently, object-based audio content has become increasingly popular because it carries several audio objects and audio beds (audio bed) separately, so that the audio content can be presented with improved accuracy over conventional presentation methods. Audio objects refer to individual audio elements that may be present for a defined period of time, but also contain spatial information describing, as an example, the position, velocity, and size of each object in the form of metadata. An audio bed or bed refers to an audio channel intended to be reproduced at a predefined, fixed loudspeaker position.
For example, a cinema soundtrack may include many different sound elements corresponding to on-screen images, dialog, noise, and sound effects that appear from different locations on the screen and combine with background music and environmental effects to create an overall listening experience. Accurate playback requires that sound be reproduced in a manner that makes it as close as possible to the sound source location, intensity, movement, and depth to correspond to what is displayed on the screen.
During the delivery of the audio signal, the bed and the object may be transmitted separately and then used by the spatial rendering system to recreate the artistic objective with a plurality of speakers at known physical locations. In some cases, tens or even hundreds of individual audio objects may be included for presentation of audio content. As a result, the advent of such object-based audio data has significantly increased the complexity of rendering audio data within playback systems.
The large number of audio signals present in object-based content presents new challenges for the encoding and distribution of such content. In some distribution and delivery systems, the delivery capability may be provided with a large enough available bandwidth to send all audio beds and objects but with little or no audio compression. However, in some cases, such as blu-ray disc, broadcast (cable, satellite and terrestrial), mobile (3G and 4G) and over-the-top (OTT) distribution, the available bandwidth is not able to send all of the bed and object information created by the audio mixer. Although audio coding methods (lossy or lossless) may be applied to audio to reduce the required bandwidth, audio coding may not be sufficient to reduce the bandwidth required to transmit audio, particularly in very limited networks such as mobile 3G and 4G networks.
Some existing approaches utilize clustering of audio objects in order to reduce the number of input objects and beds to a smaller set of output clusters. Thereby, computational complexity and memory requirements are reduced. However, accuracy may be compromised because existing methods only assign objects in a relatively crude manner.
Disclosure of Invention
Example embodiments disclosed herein propose a method and system for processing an audio signal for reducing the number of audio objects by assigning these objects into clusters, while maintaining performance in terms of accuracy of spatial audio reproduction.
In one aspect, example embodiments disclosed herein provide a method of processing an audio signal. The audio signal has a plurality of audio objects. The method comprises obtaining an object position for each audio object; and determining cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The method also includes determining an object-to-cluster gain based on the set of object locations, cluster locations, and metrics; and generating a cluster signal based on the determined cluster position and the object-to-cluster gain.
In another aspect, example embodiments disclosed herein provide a system for processing an audio signal. The audio signal has a plurality of audio objects. The system includes an object position acquisition unit configured to acquire an object position for each audio object; and a cluster position determination unit configured to determine cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The system further includes an object-to-cluster gain determination unit configured to determine an object-to-cluster gain based on the set of object locations, cluster locations, and metrics; and a cluster signal generation unit configured to generate a cluster signal based on the determined cluster position and the object-to-cluster gain.
From the following description, it will be understood that object-based audio signals containing audio objects and audio beds are greatly compressed for data streaming, and thus the computational and bandwidth requirements for those signals are significantly reduced. The accurate generation of several clusters enables the reproduction of an auditory scene with high accuracy in which the listener can correctly perceive the positioning of each audio object, so that an immersive reproduction can be achieved accordingly. While the reduced requirements for data transmission rates due to efficient compression allows less compromised fidelity for any known playback system, such as speaker arrays and headphones.
Drawings
The foregoing and other objects, features and advantages of the example embodiments disclosed herein will be more readily understood from the following detailed description taken in conjunction with the accompanying drawings. The exemplary embodiments disclosed herein are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which:
fig. 1 illustrates a flow chart of a method of processing an audio signal according to an example embodiment;
fig. 2 illustrates an example flow diagram of object-based audio signal processing according to an example embodiment;
FIG. 3 illustrates a system for processing an audio signal according to an example embodiment; and
FIG. 4 illustrates a block diagram of an example computer system suitable for implementing the example embodiments disclosed herein.
Throughout the drawings, the same or corresponding reference numerals designate the same or corresponding parts.
Detailed Description
The principles of the example embodiments disclosed herein will now be described with reference to various example embodiments shown in the drawings. It should be understood that the description of these embodiments is merely intended to enable those skilled in the art to better understand and further practice the example embodiments disclosed herein, and is not intended to limit the scope in any way.
The object based audio signals are intended to be processed by a system capable of processing audio objects and their corresponding metadata. Information such as location, speed, width, etc. is provided within the metadata. These object based audio signals are typically produced by a mixer in a studio and adapted to be rendered by different systems with appropriate processors. However, since the embodiments disclosed herein focus mainly on how to allocate objects into a reduced number of clusters while maintaining performance in terms of accuracy of spatial audio reproduction, the mixing and rendering process is not specifically described.
It may be assumed that the audio signal is divided into individual frames that are subjected to analysis throughout the specification. Such a segmentation may be applied on a time domain waveform, while a filter bank or any other transform domain suitable for the example embodiments disclosed herein may be applied.
Fig. 1 illustrates a flow chart of a method 100 of processing an audio signal according to an example embodiment. In step S101, an object position for each of the audio objects is acquired. Object-based audio objects typically contain metadata that provides positional information about the object. Such information is useful for various processing techniques in cases where object-based audio content is to be presented with greater accuracy.
In step S102, cluster positions for grouping audio objects into clusters are determined based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of the determined cluster position and a quality of the determined object-to-cluster gain. Such quality may be represented by a cost function, for example, as will be described below. The cluster position refers to the centroid of a cluster from several different sets of audio objects that are close to each other. Clusters may be selected in different ways including, for example: randomly selecting a cluster location; applying an initial clustering on the plurality of audio objects to obtain cluster positions (e.g., k-means clustering); and determining a cluster position for a current time frame of the audio signal based on the cluster position for a previous time frame of the audio signal. One of the object-to-cluster gains defines a ratio at which each audio object is grouped into a corresponding one of the clusters, and the gains indicate how the audio objects are grouped into clusters.
In step S103, an object-to-cluster gain is determined based on the object position, the cluster position, and the set of metrics. Each audio object may be assigned an object-to-cluster gain for acting as a coefficient. In other words, if the object-to-cluster gain for a particular audio object is large relative to the object-to-cluster gain of a cluster, the object may be spatially near the cluster. Of course, a larger object-to-cluster gain for one audio object relative to some clusters means that the object-to-cluster gain for the same audio object relative to other clusters can be relatively small.
Steps S102 and S103 define that the determination of cluster positions is based in part on object-to-cluster gains and the determination of object-to-cluster gains is based in part on object positions, meaning that the two determination steps are interdependent. The quality of this determination may be indicated by a value associated with the metric. In general, a trend of values associated with the metric falling or converging to a predetermined value may be used to maintain the determination until the quality is sufficiently satisfactory. A predefined threshold may be set so that it can be compared to a value associated with the metric. As a result, in some embodiments, the determination of cluster position and object-to-cluster gain will be performed alternately until the value is less than a predefined threshold.
Alternatively, another predefined threshold may be set so that it can be compared to the rate of change of the value associated with the metric. As a result, in some embodiments, the cluster location and object-to-cluster gain will maintain this determination process until the rate of change of the values associated with the metrics (e.g., the rate of decline) is less than a predefined threshold.
In an embodiment, the cost function may be adapted to represent a value associated with the metric, which may thus reflect the quality of the determined cluster position and the quality of the determined object-to-cluster gain. Therefore, the calculation of the cost function will be explained in detail below.
The cost function includes various cumulative terms by considering various metrics of the clustering process. In one embodiment, each metric may include (a) a position error between a position of an audio object reconstructed in the cluster signal and a position of the audio object in the audio signal; (B) a distance error between a position of the cluster and a position of the audio object; (C) deviation of the sum of the object-to-cluster gains from unity; (D) a presentation error between presenting the cluster signal to the one or more playback systems and presenting the audio object in the audio signal to the one or more playback systems; and (E) an inter-frame inconsistency of the variable between the current time frame and a previous time frame. It is useful that the cost function is used to compare the signals before and after the clustering process, i.e. before and after the audio objects are grouped into clusters. Therefore, the cost function can be an effective index reflecting the quality of the clustering.
For metric (a), since the input audio objects can be reconstructed from the output clusters, the error between the original object position and the reconstructed object position can be used to measure the spatial position difference of the objects, describing how accurate the clustering process is for the position information.
The term "position error" may be related to the position p of the output cluster across which its signal is being outputcThe spatial position of the audio object after distribution, which relates to the spatial position of the audio object before and after the clustering process. In particular, when the original position is represented by a vector
Figure BDA0000777245640000061
When expressed (e.g., which may be expressed in 3 Cartesian coordinates), the location is reconstructed
Figure BDA0000777245640000062
Can be expressed as a source of amplitude translation, such as:
Figure BDA0000777245640000063
subsequently, a cost E associated with the position errorPCan be expressed as:
Figure BDA0000777245640000064
wherein woA weight representing the o-th object, which may be the energy, loudness or partial loudness of the object. go,cRepresenting the gain of rendering the o-th object to the c-th cluster, or object to cluster gain.
For metric (B), since rendering an audio object into clusters with a large distance between them may introduce large timbre changes, the object-to-cluster distance may be used to measure timbre changes. When an audio object is not represented by a point source (cluster) but by a virtual (phantom) source translated across numerous clusters, a change in timbre is expected. It is well known that amplitude-translated sources may have a different timbre than point sources due to comb filter interactions that may occur when one and the same signal is reproduced by two or more (virtual) loudspeakers.
The term "range error" can be represented by EDRepresentation, which may be represented by the position of the audio object
Figure BDA0000777245640000065
And cluster position
Figure BDA0000777245640000066
The distance between reflects the increase in cost if the audio object is to be represented by a cluster that is far from the original object location:
Figure BDA0000777245640000067
for metric (C), the object-to-cluster gain normalization error can be used to measure the energy (loudness) change before and after the clustering process.
The term "deviation" may be represented by ENRepresentation, which relates to gain normalization, or more specifically, to the deviation between the sum of the gains and the unit (one) for a particular cluster centroid:
Figure BDA0000777245640000071
for metric (D), one or more reference playback systems for that metric (e.g., mono quality on a 7.1.4 speaker playback system) may need to be specified since there are different rendering outputs for different playback systems. By comparing the difference between the rendered output of the original object and the rendered output of the cluster on a specified reference playback system, the mono quality of the clustered results can be measured.
The term "presentation error" may be represented by ERIs shown byTo errors for a reference playback system used to measure the difference between rendering original objects to the reference playback system and rendering clusters to the reference playback system, which may be binaural, 5.1, 7.14, 9.16, etc.
Figure BDA0000777245640000072
Wherein
Figure BDA0000777245640000073
Wherein g iso,sRepresenting the gain, g, for rendering the o-th object to the s-th output channelc,sRepresents the gain for rendering the c-th cluster to the s-th output channel, and nsIs to normalize the presentation differences so that the presentation errors on each channel are comparable. The parameter a is to avoid introducing too large presentation differences when the signal on the reference playback system is very small or even zero.
In one embodiment, the summation over the loudspeakers using the subscript s may be performed over one or more loudspeakers of a particular predetermined loudspeaker layout. Alternatively, clusters and objects are presented simultaneously to a larger set of speakers covering a variety of speaker layouts. For example, if one layout is a 5-channel layout and the second layout is to comprise a two-channel layout, both clusters and objects can be rendered in parallel to the 5-channel and two-channel layouts. Then, the error term EREstimated over all 7 loudspeakers to jointly optimize the error term for both loudspeaker layouts simultaneously.
For metric (E), since the clustering process is performed as a function of the frame, inter-frame inconsistencies of some variables in the clustering process, such as object-to-cluster gain, cluster position, and reconstructed object position, can be used to measure the metric of the object. In one embodiment, the inter-frame inconsistency of reconstructed object positions may be used to measure the temporal smoothness of the clustering results.
The term "inter-frame inconsistency" may be represented by ECIndicating that it relates to an inter-frame inconsistency of a particular variable of the reconstructed object. Suppose that
Figure BDA0000777245640000081
And
Figure BDA0000777245640000082
is the original object position in the t frame and the t-1 frame,
Figure BDA0000777245640000083
and
Figure BDA0000777245640000084
is the reconstructed object position in the t frame and the t-1 frame, and
Figure BDA0000777245640000085
is the target reconstruction object position in the t frame. Reconstructing the position as defined by equation (1) above
Figure BDA0000777245640000086
Can be expressed as a source of amplitude translation.
For preserving inter-frame smoothness, the target reconstructed object position in the t-frame may be expressed as an offset Δ of the reconstructed subtended position in the t-1 frame from the object from the t-1 frame to the t-frameoThe combination of (A) and (B):
Figure BDA0000777245640000087
subsequently, a cost E associated with the inter-frame inconsistencyCCan be expressed as:
Figure BDA0000777245640000088
the above metrics may be measured independently or as an overall cost that is a combination of the above described metrics. In one embodiment, the overall cost may be a weighted sum of cost terms (a) through (E):
E=αPEPDEDNENRERCEC(9)
in another embodiment, the total cost may also be the maximum of the cost term:
E=max{αPEP,αDED,αNEN,αRER,αCEC} (10)
α thereinP,αD,αN,αR,αCWeights for cost terms (A) through (E) are expressed.
Gain go,cPosition, position
Figure BDA0000777245640000089
And
Figure BDA00007772456400000810
can be written as a matrix:
Figure BDA0000777245640000091
Figure BDA0000777245640000092
Figure BDA0000777245640000093
Figure BDA0000777245640000094
the object weights can be written as a diagonal matrix:
Figure BDA0000777245640000095
subsequently, the different cost function terms can be written as follows:
Figure BDA0000777245640000096
wherein H ═ diag (G)OC1C*O) And diag () denotes an operation of acquiring a diagonal matrix.
Figure BDA0000777245640000097
Represents a full 1 vector with C × 1 bins, or a length C vector such that all coefficients are equal to +1, and 1C*ORepresenting an all 1 matrix with C × O elements.
Figure BDA0000777245640000098
Λ thereinORepresentation with diagonal elements
Figure BDA0000777245640000099
The diagonal matrix of (a).
Figure BDA0000777245640000101
Figure BDA0000777245640000102
Wherein N isSRepresentation with diagonal elements nSThe diagonal matrix of (a) is,
Figure BDA0000777245640000103
representing a vector indicating the gain for rendering the o-th object to the reference loudspeaker, GCSA matrix containing cluster-to-speaker gains is represented.
Figure BDA0000777245640000104
With the items defined above, the details of the determination process will be given in the following description.
Returning to fig. 1, in step S104, a cluster signal to be rendered is generated based on the cluster position and the object-to-cluster gain determined in steps S102 and S103. The generated cluster signal typically has a number of clusters that is much smaller than the number of audio objects contained in the audio content or audio signal, such that the demand on computational resources for rendering the auditory scene is significantly reduced.
Fig. 2 illustrates an example flow 200 of object-based audio signal processing according to an example embodiment.
Block 210 may generate a volume of audio objects, audio beds, and metadata within the audio content to be processed according to an example embodiment. Block 220 is used for a clustering process that groups a plurality of audio objects into a relatively small number of clusters. At block 230, the cluster signal is output along with the newly generated metadata, causing rendering by block 240, which is represented for the renderer of the particular audio playback system. In other words, FIG. 2 shows an overview of an ecosystem involving orchestration 210, clustering 220, allocation 230, and presentation 240. After clustering, the cluster signals and metadata may be distributed to multiple renderers aimed at different speaker playback settings or headphone reproduction.
It may be assumed that the audio content is represented by a bed (or static objects, or conventional soundtracks) and (dynamic) objects. The objects comprise audio signals and associated metadata indicative of spatial rendering information as a function of time. To reduce the data rate of multiple beds and objects, clustering with multiple beds and objects as inputs is applied, and smaller sets of objects (referred to as clusters) are generated to represent the original content in a data efficient manner.
The clustering process typically includes both determining a set of cluster locations and aggregating (or rendering) objects as clusters. These two processes have complex cross-correlation, as the presentation of objects into clusters may depend on the location of the clusters, while the overall presentation quality may depend on the cluster location and the object-to-cluster gain. It is desirable to optimize cluster position and object-to-cluster gain in a coordinated manner.
In one embodiment, the optimized object-to-cluster gains and cluster positions may be obtained by minimizing a cost function as described above. However, since there is no closed-form scheme of obtaining the optimal object-to-cluster gain and cluster position together, one exampleThe approach is to use an EM-like (expectation-maximization) iterative process to determine object-to-cluster gains and cluster positions accordingly. In step E, a cluster position P is givenCObject to cluster gain GOCCan be determined by minimizing a cost function; in M step, an object to cluster gain G is givenOCCluster position PCCan be determined by minimizing a cost function. The stopping criterion is used to decide whether to continue or stop the iteration.
Given cluster position PCObject-to-cluster gain G to achieve minimum of cost function EOCThis may be obtained at block 222 in fig. 2 by solving the following function:
Figure BDA0000777245640000111
wherein, for metric (a):
Figure BDA0000777245640000112
for metric (B):
Figure BDA0000777245640000113
for metric (C):
Figure BDA0000777245640000121
for metric (D):
Figure BDA0000777245640000122
for metric (E):
Figure BDA0000777245640000123
by solving the above equation, an object-to-cluster gain matrix is obtained, such as:
Figure BDA0000777245640000124
wherein:
Figure BDA0000777245640000125
wherein:
BP=0
BD=0
Figure BDA0000777245640000126
Figure BDA0000777245640000127
BC=0
Figure BDA0000777245640000128
AD=wooo T)
Figure BDA0000777245640000131
AR=wo(2GCSNSGCS T)
Figure BDA0000777245640000132
it follows that the object-to-cluster gain can be determined based on cluster position.
Given object-to-cluster gain GOCLocal minimum of cost function E and optimal cluster position PCThis may be obtained at block 221 in fig. 2 by solving the following function:
Figure BDA0000777245640000133
however, since there is no closed form solution to the above equation, the optimal cluster position P is obtained by the gradient descent methodC
Figure BDA0000777245640000134
Where i denotes the number of iterations of the gradient descent and σ denotes the learning step. For metrics (a), (B), and (C), the gradient of each cost term can be derived as follows:
Figure BDA0000777245640000135
where tr { } represents the matrix tracking function of the diagonal elements of the summation matrix.
Figure BDA0000777245640000136
Figure BDA0000777245640000137
Figure BDA0000777245640000141
Figure BDA0000777245640000142
Figure BDA0000777245640000143
Wherein p isCxDenotes the position of the c-th output cluster (from 1 to c) along the x-axis in a 3-Cartesian coordinate system, pCyDenotes the position of the c-th output cluster along the y-axis in a 3-Cartesian coordinate system, pCzIndicating the position of the c-th output cluster along the z-axis in a 3 cartesian coordinate system. For the metric (D), having:
Figure BDA0000777245640000144
Figure BDA0000777245640000145
Figure BDA0000777245640000146
wherein g isc,sRepresenting the gain of rendering the clusters into the reference playback system,
Figure BDA0000777245640000147
and
Figure BDA0000777245640000148
representing the gradient of the presented gain.
For example, for a standard Atmos renderer, the gain may be calculated as follows:
gc,s(pcx,pcy,pcz)=fsx(pcx)fsy(pcy)fsz(pcz) (35)
wherein f issx()、fsy() And fsz() Represents the gain function of the Atmos renderer on the s channel with respect to x position, y position and z position, respectively, and for the metric (E):
Figure BDA0000777245640000151
it follows that cluster positions can be determined based on object-to-cluster gains.
There are many ways to initialize the cluster location for this iterative process. For example, random initialization or k-means based initialization may be used to initialize the cluster position for each processing frame. However, to avoid convergence to different local minima in adjacent frames, the cluster position of the previous frame acquired may be used to initialize the cluster position of the current frame. Furthermore, for example, a hybrid method of selecting a cluster position having the smallest cost from a plurality of different initialization methods may be applied to initialize the determination process.
After performing either of the steps represented by blocks 221 and 222, the cost function will be evaluated at block 223 to test whether the value of the cost function is small enough to stop the iteration. The iteration will be stopped when the value of the cost function is smaller than a predefined threshold, or the rate of decline of the cost function value is very small. The predefined threshold value may be manually set in advance by the user. In another embodiment, the steps represented by blocks 221 and 222 may be alternately implemented until the value of the cost function, or the rate of change thereof, equals a predefined threshold. In some use cases, it may be sufficient to perform the steps represented by blocks 221 and 222 in fig. 2 only a predetermined number of times, rather than performing these steps until the overall error reaches a threshold.
It is to be understood that the EM iterative method described above is merely an example embodiment, and that other rules may be applied to jointly estimate cluster position and object-to-cluster gain.
This iterative step or determination process ensures that a plurality of clusters with improved accuracy are generated, so that an immersive reproduction of the audio content can be achieved. While the reduced requirements for data transmission rates due to efficient compression allows less compromised fidelity for any known playback system, such as speaker arrays and headphones.
Fig. 3 illustrates a system 500 for processing an audio signal comprising a plurality of audio objects according to an example embodiment. As shown, the system 300 includes an object position acquisition unit 301 configured to acquire an object position for each audio object; and a cluster position determination unit 302 configured to determine cluster positions for grouping the audio objects into clusters based on the object positions, the plurality of object-to-cluster gains, and the set of metrics. The metric indicates a quality of cluster positions and a quality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and one of the object-to-cluster gains defining a ratio of the respective audio object in the one cluster. The system 300 further comprises an object-to-cluster gain determination unit configured to determine an object-to-cluster gain based on the set of object positions, cluster positions, and metrics; and a cluster signal generation unit 304 configured to generate a cluster signal to be presented based on the determined cluster position and the object-to-cluster gain.
In an example embodiment, the system 300 may further comprise an alternation determination unit configured to alternately perform the determination of the cluster position and the determination of the object-to-cluster gain until a predetermined condition is satisfied. In a further embodiment, the predetermined condition may comprise at least one of: the value associated with the metric is less than a predefined threshold and a rate of change of the value associated with the metric is less than another predefined threshold.
In another example embodiment, the metrics may include at least one of: a position error between a position of the audio object reconstructed in the cluster signal and the object position; a distance error between the cluster position and the object position; a deviation of a sum of object-to-cluster gains from one; a presentation error between presenting the cluster signal to the one or more playback systems and presenting the audio signal to the one or more playback systems; and inter-frame inconsistency of the variable between the current time frame and the previous time frame. In a further example embodiment, the variable may comprise at least one of an object-to-cluster gain, a cluster position, and a position of the reconstructed audio object. Alternatively, the alternation determination unit may be further configured to alternatingly perform the determination of the cluster position and the determination of the object-to-cluster gain based on a weighted combination of the set of metrics.
In yet another example embodiment, the system 300 may further include a cluster location initialization unit configured to initialize the cluster location based on at least one of: randomly selecting a cluster location; applying an initial clustering on the plurality of audio objects to obtain cluster positions; and determining a cluster position for a current time frame of the audio signal based on the cluster position for a previous time frame of the audio signal.
For clarity, some optional components of system 300 are not shown in fig. 3. It should be understood, however, that the features described above with reference to fig. 1-2 are applicable to the system 300. Furthermore, the components of the system 300 may be hardware modules or software unit modules. For example, in some embodiments, system 300 may be partially or completely implemented in software and/or firmware, e.g., as a computer program product embodied in a computer-readable medium. Alternatively or additionally, system 300 may be partially or completely implemented in hardware, e.g., as an Integrated Circuit (IC), an Application Specific Integrated Circuit (ASIC), a system on a chip (SOC), a Field Programmable Gate Array (FPGA), or the like. The scope of the invention is not limited in this respect.
FIG. 4 illustrates a block diagram of an example computer system 400 suitable for implementing example embodiments disclosed herein. As shown, the computer system 400 includes a Central Processing Unit (CPU)401 capable of executing various processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage area 408 to a Random Access Memory (RAM) 403. In the RAM 403, when the CPU 401 executes various processes and the like, necessary data is also stored as necessary. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as needed, so that a computer program read out therefrom is mounted into the storage section 408 as needed.
In particular, according to example embodiments disclosed herein, the processes described above with reference to fig. 1 to 2 may be implemented as computer software programs. For example, example embodiments disclosed herein include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method 100. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411.
In general, the various example embodiments disclosed herein may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the example embodiments disclosed herein are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Also, blocks in the flow diagrams may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements understood to perform the associated functions. For example, example embodiments disclosed herein include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program containing program code configured to perform the method described above.
In the context of this disclosure, a machine-readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination thereof. More detailed examples of a machine-readable storage medium include an electrical connection with one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical storage device, a magnetic storage device, or any suitable combination thereof.
Computer program code for carrying out methods of the present invention may be written in one or more programming languages. These computer program codes may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the computer or other programmable data processing apparatus, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. The program code may execute entirely on the computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed between one or more remote computers or servers.
Additionally, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking or parallel processing may be advantageous. Likewise, while the above discussion contains certain specific implementation details, this should not be construed as limiting the scope of any invention or claims, but rather as describing particular embodiments that may be directed to particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in any suitable subcombination, separately, in multiple embodiments.
Various modifications, adaptations, and other embodiments of the present invention will become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings. Any and all modifications will still fall within the scope of the non-limiting and exemplary embodiments of this invention. Moreover, the foregoing description and drawings provide instructive benefits, and other example embodiments set forth herein will occur to those skilled in the art to which such embodiments pertain.
Accordingly, example embodiments disclosed herein may be embodied in any of the forms described herein. For example, the following listing of example embodiments (EEEs) describes some of the structure, features, and functionality of some aspects of the present invention.
EEE 1. a method of processing object-based audio data, comprising:
determining a plurality of metric-based cost functions for combining the first plurality of audio objects into the second plurality of audio objects.
Combining the first plurality of audio objects into the second plurality of audio objects by jointly optimizing spatial positions and rendering gains of the second plurality of audio objects to minimize a cost function.
EEE 2. the method according to EEE 1, wherein the plurality of metrics comprises at least one of:
spatial representation
Timbre preservation
Loudness preservation
Quality of the monaural
Temporal smoothness
EEE 3. the method according to EEE 2, wherein the spatial representation can be measured by the position error of the reconstruction of the object.
EEE 4. the method according to EEE 2, wherein the timbre preservation can be measured by the object-to-cluster distance.
EEE 5. the method according to EEE 2, wherein the loudness retention can be measured by the object-to-cluster gain normalization error.
EEE 6. the method according to EEE 2, wherein the mono quality can be measured by a rendering error on at least one or more predefined reference playback systems.
EEE 7. the method according to EEE 2, wherein the practical smoothness may be measured by an inter-frame inconsistency of at least one variable in the clustering result.
EEE 8. the method according to EEE 7, wherein the variable may be an object-to-cluster gain, a cluster position, or a reconstructed object position.
EEE 9. the method according to EEE 1, wherein the cost function may be a combination of cost terms based on a plurality of metrics.
EEE 10. the method according to EEE 9, wherein different weights are applied to the cost terms of a plurality of metrics.
EEE 11. the method of EEE 10, wherein the different weights are determined in response to human input.
EEE 12. the method according to EEE 11, wherein an E-M-like iterative optimization method can be used to minimize the cost function.
EEE 13. the method according to any of the preceding EEEs, wherein the one or more reference speaker settings are determined by human input.
EEE 14. the method according to any of the preceding EEEs, wherein the reference renderer may be any one of a speaker renderer or an earpiece renderer.

Claims (13)

1. A method of processing an audio signal comprising a plurality of audio objects, comprising:
obtaining an object position for each of the audio objects;
determining cluster positions for grouping the audio objects into clusters based on the object positions and a set of metrics indicative of a quality of the cluster positions and a quality of the object-to-cluster gains, given a plurality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and the plurality of object-to-cluster gains being indicative for each of the audio objects for determining a reconstructed object position of the audio object from the cluster positions of the clusters;
determining the plurality of object-to-cluster gains based on the set of object positions and metrics given the cluster position, wherein the steps of determining cluster positions and determining object-to-cluster gains are interdependent and are part of an iterative process until a predetermined condition is satisfied; and
generating a cluster signal based on the determined cluster position and the object-to-cluster gain;
wherein the metric comprises at least one of:
a position error between a position of an audio object reconstructed in the cluster signal and the object position;
a distance error between the cluster location and the object location;
a deviation of a sum of the object-to-cluster gains from a value of 1;
a presentation error between presenting the cluster signal to one or more playback systems and presenting the audio signal to the one or more playback systems; and
inter-frame inconsistency of the variable between a current time frame and a previous time frame; and is
Wherein the variable comprises at least one of the object-to-cluster gain, the cluster position, and the position of the reconstructed audio object.
2. The method of claim 1, further comprising:
alternately performing the determination of the cluster position and the determination of the object-to-cluster gain until the predetermined condition is satisfied.
3. The method of claim 2, wherein the predetermined condition comprises at least one of:
a value associated with the metric being less than a predefined threshold, an
A rate of change of the value associated with the metric is less than another predefined threshold.
4. The method of claim 2, wherein the alternately performing the determination of the cluster location and the determination of the object-to-cluster gain is based on a weighted combination of the set of metrics.
5. The method of any of claims 1 to 3, further comprising:
initializing the cluster location based on at least one of:
randomly selecting the cluster location;
applying an initial clustering on the plurality of audio objects to obtain the cluster position; and
determining the cluster position for a current time frame of the audio signal based on the cluster position for a previous time frame of the audio signal.
6. The method of claim 1, wherein
A large object-to-cluster gain for an audio object relative to a cluster indicates that the audio object is in the vicinity of the cluster, and vice versa;
an object-to-cluster gain for the audio object relative to a cluster having a cluster position represents a gain for rendering the audio object to the cluster position of the cluster; and/or
The plurality of object-to-cluster gains comprises an object-to-cluster gain for each audio object of the plurality of audio objects relative to each cluster of the clusters.
7. A method of processing an audio signal comprising a plurality of audio objects, comprising:
obtaining an object position for each of the audio objects;
determining cluster positions for grouping the audio objects into clusters based on the object positions and a set of metrics indicative of a quality of the cluster positions and a quality of the object-to-cluster gains, given a plurality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and the plurality of object-to-cluster gains being indicated for each of the audio objects for determining a reconstructed object position of the audio object from the cluster positions of the clusters;
determining the plurality of object-to-cluster gains based on the object locations and the set of metrics given the cluster locations, wherein the steps of determining cluster locations and determining object-to-cluster gains are interdependent and are part of an iterative process until a predetermined condition is satisfied; and
generating a cluster signal based on the determined cluster position and the object-to-cluster gain;
wherein
Figure FDA0002581076620000031
Is a vector representing the cluster position of the c-th cluster;
go,cis the object-to-cluster gain of the o-th object relative to the c-th cluster; and is
Figure FDA0002581076620000032
Is a vector representing the reconstructed object position of the o-th object, wherein
Figure FDA0002581076620000033
8. A system for processing an audio signal comprising a plurality of audio objects, comprising:
an object position acquisition unit configured to acquire an object position for each of the audio objects;
a cluster position determination unit configured to: determining cluster positions for grouping the audio objects into clusters based on the object positions and a set of metrics indicative of a quality of the cluster positions and a quality of the object-to-cluster gains, given a plurality of object-to-cluster gains, each of the cluster positions being a centroid of a respective one of the clusters, and the plurality of object-to-cluster gains being indicative for each of the audio objects for determining a reconstructed object position of the audio object from the cluster positions of the clusters;
an object-to-cluster gain determination unit configured to determine the plurality of object-to-cluster gains based on the object locations and the set of metrics given the cluster locations, wherein the steps of determining cluster locations and determining object-to-cluster gains are interdependent and are part of an iterative process until a predetermined condition is satisfied; and
a cluster signal generation unit configured to generate a cluster signal based on the determined cluster position and the object-to-cluster gain;
wherein the metric comprises at least one of:
a position error between a position of an audio object reconstructed in the cluster signal and the object position;
a distance error between the cluster location and the object location;
a deviation of a sum of the object-to-cluster gains from a value of 1;
a presentation error between presenting the cluster signal to one or more playback systems and presenting the audio signal to the one or more playback systems; and
inter-frame inconsistency of the variable between a current time frame and a previous time frame; and is
Wherein the variable comprises at least one of the object-to-cluster gain, the cluster position, and the position of the reconstructed audio object.
9. The system of claim 8, further comprising:
an alternation determination unit configured to alternately perform the determination of the cluster position and the determination of the object-to-cluster gain until the predetermined condition is satisfied.
10. The system of claim 9, wherein the predetermined condition comprises at least one of:
a value associated with the metric being less than a predefined threshold, an
A rate of change of the value associated with the metric is less than another predefined threshold.
11. The system of claim 9, wherein the alternation determination unit is further configured to alternately perform the determination of the cluster position and the determination of the object-to-cluster gain based on a weighted combination of the set of metrics.
12. The system of any of claims 8 to 10, further comprising:
a cluster position initialization unit configured to initialize the cluster position based on at least one of:
randomly selecting the cluster location;
applying an initial clustering on the plurality of audio objects to obtain the cluster position; and
determining the cluster position for a current time frame of the audio signal based on the cluster position for a previous time frame of the audio signal.
13. A computer-readable medium storing a computer program executable by a processor to implement the steps of the method according to any one of claims 1 to 6.
CN201510484949.8A 2015-08-07 2015-08-07 Processing object-based audio signals Active CN106385660B (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201510484949.8A CN106385660B (en) 2015-08-07 2015-08-07 Processing object-based audio signals
PCT/US2016/045512 WO2017027308A1 (en) 2015-08-07 2016-08-04 Processing object-based audio signals
US15/749,750 US10277997B2 (en) 2015-08-07 2016-08-04 Processing object-based audio signals
EP16751763.0A EP3332557B1 (en) 2015-08-07 2016-08-04 Processing object-based audio signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510484949.8A CN106385660B (en) 2015-08-07 2015-08-07 Processing object-based audio signals

Publications (2)

Publication Number Publication Date
CN106385660A CN106385660A (en) 2017-02-08
CN106385660B true CN106385660B (en) 2020-10-16

Family

ID=57916386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510484949.8A Active CN106385660B (en) 2015-08-07 2015-08-07 Processing object-based audio signals

Country Status (1)

Country Link
CN (1) CN106385660B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102506167B1 (en) * 2017-04-25 2023-03-07 소니그룹주식회사 Signal processing device and method, and program
CN110166927B (en) * 2019-05-13 2020-05-12 武汉大学 Virtual sound image reconstruction method based on positioning correction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361405A (en) * 2006-01-03 2009-02-04 Slh音箱公司 Method and system for equalizing a loudspeaker in a room
CN103593430A (en) * 2013-11-11 2014-02-19 胡宝清 Clustering method based on mobile object spatiotemporal information trajectory subsections
WO2015017037A1 (en) * 2013-07-30 2015-02-05 Dolby International Ab Panning of audio objects to arbitrary speaker layouts
WO2015105748A1 (en) * 2014-01-09 2015-07-16 Dolby Laboratories Licensing Corporation Spatial error metrics of audio content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361405A (en) * 2006-01-03 2009-02-04 Slh音箱公司 Method and system for equalizing a loudspeaker in a room
WO2015017037A1 (en) * 2013-07-30 2015-02-05 Dolby International Ab Panning of audio objects to arbitrary speaker layouts
CN103593430A (en) * 2013-11-11 2014-02-19 胡宝清 Clustering method based on mobile object spatiotemporal information trajectory subsections
CN103593430B (en) * 2013-11-11 2017-03-22 胡宝清 Clustering method based on mobile object spatiotemporal information trajectory subsections
WO2015105748A1 (en) * 2014-01-09 2015-07-16 Dolby Laboratories Licensing Corporation Spatial error metrics of audio content

Also Published As

Publication number Publication date
CN106385660A (en) 2017-02-08

Similar Documents

Publication Publication Date Title
US11736890B2 (en) Method, apparatus or systems for processing audio objects
US11470437B2 (en) Processing object-based audio signals
EP3332557B1 (en) Processing object-based audio signals
JP6330034B2 (en) Adaptive audio content generation
CN112262585A (en) Ambient stereo depth extraction
JP7362826B2 (en) Metadata preserving audio object clustering
US10362426B2 (en) Upmixing of audio signals
US10278000B2 (en) Audio object clustering with single channel quality preservation
CN106385660B (en) Processing object-based audio signals
CN117837173A (en) Signal processing method and device for audio rendering and electronic equipment
US10779106B2 (en) Audio object clustering based on renderer-aware perceptual difference
WO2018017394A1 (en) Audio object clustering based on renderer-aware perceptual difference
RU2773512C2 (en) Clustering audio objects with preserving metadata

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant