CN117223057A

CN117223057A - Dynamic range adjustment of spatial audio objects

Info

Publication number: CN117223057A
Application number: CN202280031384.6A
Authority: CN
Inventors: D·J·布瑞巴特; B·G·克罗克特; R·M·福瑞德瑞驰; J·R·格拉斯高; D·C·琼斯; E·W·耶尔甘
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2021-05-28
Filing date: 2022-03-24
Publication date: 2023-12-12
Also published as: EP4348643A1; WO2022250772A1; KR20240014462A; JP2024520005A; BR112023021544A2; US20240163529A1

Abstract

The present disclosure relates to a method and an audio processing system for performing dynamic range adjustment of spatial audio objects. The method comprises obtaining (step S1) a plurality of spatial audio objects (10), obtaining (step S2) at least one rendered audio presentation of the spatial audio objects (10), and determining (step S3) signal level data associated with each of a set of presentation audio channels. The method further comprises obtaining (step S31) a threshold value and for each time period selecting (step S4) a selected rendered audio channel associated with the highest or lowest signal level, determining (step S5) a gain based on the threshold value and a representation of the signal level of the selected audio channel, and applying (step S6) the gain for each time period to the corresponding time period of the spatial audio object.

Description

Dynamic range adjustment of spatial audio objects

Cross Reference to Related Applications

The present application claims priority from U.S. provisional application No. 63/194,359 filed on 5/28 of 2021, the entire contents of which are incorporated herein by reference.

Field of the application

The present application relates to a method for performing dynamic range adjustment of a spatial audio object, and an audio processing system employing the method.

Background

In the field of audio mastering (mastering), mastering engineers typically receive a rendered audio presentation and perform, for example, equalization or other forms of audio processing, making it suitable for playback on a target playback system, such as a headphone or home cinema audio system. For example, if the audio presentation is a high quality stereo signal recorded in a professional recording studio, the mastering engineer may need to modify the dynamic range or equalization of the high quality stereo signal to obtain a mastered stereo signal that is more suitable for low bit rate digitizing and/or playback by a simple stereo device (e.g. earplug).

During mastering, and in particular during music mastering, different forms of peak limiters are used to ensure that the rendered audio signal does not exceed a peak threshold. Furthermore, the use of peak limiters is an effective tool to change the dynamic range or other properties of the audio signal of the rendered presentation, which will affect how the end user perceives the mastered presentation.

In a similar manner, an audio compressor is used in the mastering process to achieve upward and/or downward compression of rendered presentation audio signals. For example, the downward audio compressor will apply an attenuation to the audio signal with a signal level above a predetermined threshold, wherein the applied attenuation increases linearly, e.g. as the signal level exceeds the threshold. Thus, the compressor will generally ensure that higher signal levels result in attenuation that introduces more aggressiveness, and vice versa for the expander.

With the introduction of object-based audio content represented by multiple audio objects, the same object-based audio content may be rendered into a multitude of different presentations, such as a stereo presentation or a multi-channel presentation, such as a 5.1 or 7.1 presentation. While this enables flexibility in rendering the same audio content to different presentations while providing an enhanced spatial audio experience, this flexibility presents problems for audio mastering. Since the presentation to which object-based audio is to be rendered is not predetermined, there is no single presentation to which the peak limiter or compressor of the mastering process can be applied.

Disclosure of Invention

One disadvantage of the presently proposed method for mastering object-based audio content is that the process is typically not lossless and may introduce undesirable audio artifacts at other presentations than the single presentation that has been mastered. In addition, the presently proposed methods for mastering object-based audio content do not allow a mastering engineer to listen to the results of the mastering process substantially in real time, and furthermore, the mastering engineer is only able to work on one predetermined presentation of the object-based audio content at a time. For example, if a mastering engineer were to create a mastering stereo presentation and a mastering 5.1 presentation of the same spatial audio content, the mastering engineer would need to perform two separate mastering processes for each of these two different presentations in succession.

These drawbacks of the prior art for performing audio mastering introduce cumbersome and repetitive workflow in mastering object-based audio content, while the resulting mastered object-based audio content may still have undesirable audio artifacts in addition to the selected few presentation formats analyzed by the mastering engineer.

It is therefore an object of the present disclosure to provide an improved method and audio processing system for performing dynamic range adjustment of spatial audio objects.

According to a first aspect of the present invention, a method for performing dynamic range adjustment of a spatial audio object is provided. The method includes obtaining a plurality of spatial audio objects, obtaining a threshold, and obtaining at least one rendered audio presentation of the spatial audio objects, wherein the at least one rendered audio presentation includes at least one rendered audio channel forming a set of rendered audio channels. The method also includes determining signal level data associated with each of the set of presentation audio channels, wherein the signal level data represents signal levels for a plurality of time periods of the presentation audio channels, and for each time period, selecting a selected presentation audio channel that is a presentation audio channel of the set of presentation audio channels that is associated with a highest signal level or a lowest signal level of the time period compared to other presentation audio channels of the set of presentation audio channels. For the selected presentation channel, the method further includes determining a gain based on the threshold and the representation of the signal level of the selected audio channel, and applying the gain for each time period to a corresponding time period of each spatial audio object to form a dynamic range adjusted spatial audio object.

Gain refers to a modification of the signal amplitude and/or power level. It should be appreciated that the modification may involve an increase or decrease in signal amplitude and/or power level. That is, the term "gain" encompasses both amplification gain (meaning an increase in amplitude and/or power) and attenuation (meaning a decrease in amplitude and/or power). To emphasize this point, the broad term "gain" will be referred to in some cases as "attenuation and/or gain" or "attenuation/gain".

That is, the method involves accurately determining the highest/lowest signal level at each time period over all of the presentation channels in the set of presentation channels, and determining the attenuation/gain based on the highest/lowest signal level and the threshold for each time period. The determined attenuation/gain is applied to a corresponding time period of each of the plurality of spatial audio objects to form a dynamic range adjusted spatial audio object, which in turn may be rendered into any presentation format.

Determining the attenuation/gain may include determining the attenuation/gain to achieve at least one of: peak limiters, bottom limiters (as opposed to peak limiters), up-compressors, down-compressors, up-expanders, down-expanders, and smoothed versions thereof. In some implementations, the threshold is obtained along with a ratio indicating the amount of attenuation/gain to be applied for signal levels above/below the threshold. Furthermore, attenuation/gain may be based on additional signal levels in addition to the highest/lowest signal levels.

For example, the attenuation/gain may be based on each time period of all presentation channels or a combination of signal levels of two, three, four or more highest/lowest presentation audio channels in each time period, e.g. a weighted average. In such an implementation, the step of selecting a presentation channel is replaced with the steps of: for each time period, an average signal level of all presentation channels in the set of presentation channels is calculated, wherein the decay gain is based on the average signal level and the obtained threshold.

The present disclosure is based, at least in part, on the following understanding: by selecting the highest/lowest rendering channel and determining the attenuation/gain based on the signal level of the selected rendering channel, dynamic range adjusted spatial audio objects may be created that will include dynamic range adjustments for any rendering format into which they are rendered. In addition, the above-described method facilitates an efficient workflow for a mastering engineer handling spatial audio objects, because the adjusted spatial audio objects can be rendered into any number of presentation formats while performing dynamic range adjustments, thereby allowing the mastering engineer to listen to the adjustments and easily switch presentation formats during mastering.

In some implementations, at least two rendered presentations are obtained, where each rendered audio presentation includes at least one presentation audio channel. Thus, the step of selecting a presentation channel may occur across two or more differently presented presentation audio channels. For example, the attenuation/gain may also be based on a representation of the signal level of a second selected rendering channel, where the second selected rendering channel is a different rendering than the selected audio channel. As described above, more than one signal level may be combined, wherein a combination of two or more signal levels is used to determine the attenuation gain.

A significantly different method of enabling mastering of object-based audio content is disclosed in WO2021007246, which involves rendering the audio content as a single presentation and allowing a mastering engineer or mastering process to perform audio processing on the single presentation to form a master presentation. By comparing the master presentation with the original presentation, differences between the master presentation and the original presentation may be extracted, wherein object-based audio content is mastered based on the determined differences.

Drawings

The present invention will be described in more detail with reference to the accompanying drawings, which show a presently preferred embodiment of the invention.

Fig. 1 is a block diagram illustrating an audio processing system for performing dynamic range adjustment of spatial audio objects according to some implementations.

Fig. 2 is a flow chart illustrating a method for performing dynamic range adjustment of spatial audio objects according to some implementations.

FIG. 3 is a block diagram illustrating an audio processing system for performing dynamic range adjustment of spatial audio objects having three renderers, each rendering the spatial audio objects into a different rendered presentation, according to some implementations.

Fig. 4 is a block diagram illustrating an audio processing system for performing dynamic range adjustment of spatial audio objects in different sub-band representations extracted by an analysis filter bank, according to some implementations.

Fig. 5 is a block diagram illustrating an audio processing system for performing dynamic range adjustment of spatial audio objects with fast and slow gains calculated in side chains, according to some implementations.

Fig. 6 is a block diagram illustrating user manipulation of output renderer parameters and/or side-chain parameters to modify dynamic range adjustment implemented by an audio processing system according to some implementations.

Detailed Description

The systems and methods disclosed in this disclosure may be implemented as software, firmware, hardware, or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division of physical units; rather, one physical component may have multiple functions, and one task may be cooperatively performed by several physical components.

The computer hardware may be, for example, a server computer, a client computer, a Personal Computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a smart phone, a Web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify operations to be taken by that computer hardware. Furthermore, the present disclosure should be directed to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein.

Some or all of the components may be implemented by one or more processors that accept computer readable (also referred to as machine readable) code containing a set of instructions that, when executed by the one or more processors, perform at least one of the methods described herein. Any processor capable of executing (sequentially or otherwise) a set of instructions specifying an operation to be taken is included. Thus, one example is a typical processing system (i.e., computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may also include a memory subsystem including a hard disk drive, SSD, RAM, and/or ROM. A bus subsystem may be included for communicating between the components. The software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system.

One or more processors may operate as stand-alone devices or may be connected (e.g., networked) to other processor(s). Such networks may be established over a variety of different network protocols and may be the internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

The software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as well as other data in a computer readable medium, such as a computer readable medium. Computer storage media includes, but is not limited to, various forms of physical (non-transitory) storage media, such as EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those skilled in the art, communication media (transient) typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

An audio processing system for dynamic range adjustment according to some implementations will be discussed with reference to fig. 1 and 2.

The plurality of spatial audio objects 10 comprises a plurality of audio signals associated with a (dynamic) spatial position. The spatial position may be represented using metadata associated with the plurality of audio signals, wherein the metadata indicates, for example, how the audio objects (audio signals) move in three-dimensional space. The collection of spatial audio objects 10 is referred to as an object-based audio asset. The object-based audio asset includes, for example, 2, 10, 20, or more spatial audio objects, such as 50 or 100 spatial audio objects, having time-varying locations indicated by associated spatial metadata.

In step S1, a spatial audio object 10 is obtained and provided to a side chain 30 of an audio processing system, which comprises at least one renderer 31, a signal level analyzer 32 and a gain calculator 33. In step S2, the renderer 31 renders the audio object 10 into a predetermined audio presentation comprising at least one presentation audio channel forming a set of presentation audio channels. The predetermined audio presentation may be set, for example, by a mastering engineer or by a preset audio presentation setting of the renderer 31. In another example, the predetermined audio presentation may be set by the type of audio content represented by the spatial audio object 10 (such as music, speech, or movie tracks).

For example, the renderer 31 renders the spatial audio objects as at least one presentation selected from the group consisting of: a mono presentation (one channel), a stereo presentation (two channels), a binaural presentation (two channels), a 5.1 presentation (six channels), a 7.1 presentation (eight channels), a 5.1.2 presentation (eight channels), a 5.1.4 presentation (ten channels), a 7.1.2 presentation (ten channels), a 7.1.4 presentation (twelve channels), a 9.1.2 presentation (twelve channels), a 9.1.4 presentation (fourteen channels), a 9.1.6 presentation (sixteen channels), and a multi-channel presentation having at least three height levels (e.g., a 22.2 presentation having twenty four channels and three height levels above, at, and below the ear level). It should be noted that these presentations are merely exemplary, and that renderer 31 may render the spatial audio objects as one or more arbitrary presentations having any number of presentation channels.

In some implementations, each presentation comprises at least two presentation audio channels, meaning that the renderer 31 is configured to render the spatial audio objects as a presentation selected from the above-mentioned group excluding a mono presentation alternative (one channel).

The presentation audio channel(s) and audio signal of each spatial audio object 10 are represented by a sequence of time periods. The time period may be a single sample, a frame, a group of two or more frames, or a predetermined time portion of an audio channel. Further, the time periods may partially overlap such that the time periods are, for example, 10 millisecond frames with 30% overlap.

The renderer 31 receives a spatial audio object x having an audio object index i and a time period index n _i [n]And is based on metadata M for object index i _i [n]Calculating a rendering channel s with a rendering index j and a speaker feed index k _j,k [n]. Each presentation includes at least one presentation audio channel that is intended to be played using a speaker with an associated speaker feed index k. For example, for stereo rendering, k=1, 2 and a first rendering audio channel (left stereo channel) is associated with a speaker feed having an index k=1 and a second rendering audio channel (right stereo channel) is associated with a speaker feed having an index k=2. In some implementations, only one presentation is used, and the index j may be omitted since there is only one presentation with k speaker feeds (presentation channels). The renderer 31 will (possibly time-varying) metadata M _i [n]Conversion to a possibly time-varying rendering gain vector g for each object index i and speaker feed index k _i,k [n]To calculate the presentation channel s according to _j,k [n]

s _j,k [n]＝∑ _i x _i [n]g _i,k [n](1)

Wherein the slave metadata M _i [n]To rendering gain vector g _i,k [n]Is generally dependent on the desired output presentation format. In general, the renderer 31 executes the spatial audio object 10 in a frequency-varying manner (i.e., x _i [n]) To presentation channel s _j,k [n]Is used for rendering the video file. For example, when rendering the spatial audio object 10 into a binaural rendering format having two rendering channels, the mapping of the spatial audio object 10 to each respective binaural channel will be frequency dependent, e.g. taking into account frequency dependent Head Related Transfer Functions (HRTFs). In another example, the audio presentation is intended to be played using speakers with different properties, which means that the renderer 31 may highlight some frequencies for some speaker feeds (presentation channels). By investigation, for presentations intended to be played on, for example, low performance audio devices, the high frequency and/or low frequency of the spatial audio object 10The frequency content can be suppressed. Furthermore, it was investigated that for example for a 5.1 presentation, the low frequency content of the spatial audio object 10 may be rendered to the LFE channel, whereas the high frequencies are emphasized for the center, left and/or right channels. However, in some simple cases, the renderer 31 performs rendering in a frequency-invariant manner.

In many cases, although not all cases, the number of spatial audio objects 10 is greater than the number of speaker feeds k.

In step S3, the rendered presentation audio channels are provided to the signal level analyzer 32, and the signal level analyzer 32 first determines signal level data associated with each of the set of rendered audio channels. The signal level data is indicative of at least one representation or measure of the signal level of each time period of each presentation channel, wherein the signal level data is for example at least one of: RMS representation of signal level/power for a time period, amplitude/power for a time period, maximum amplitude/power for a time period, and average amplitude/power for a time period. The signal level data may be determined using any suitable method and in the simple case where each presented audio signal is represented as a time domain waveform sample, the signal level data is simply the amplitude (signal) level of each sample. In another example, where the presentation audio channel is represented by a series of (possibly overlapping) frequency domain frames, the signal level may be determined as a function of the spectral energy of each frame.

In addition, the signal level analyzer 32 uses the signal level data to determine a maximum or minimum signal level max [ n ] or min [ n ] for each time period occurring in the set of presented audio signals. Alternatively, the signal level analyzer 32 determines an average signal level avg [ n ] for at least two presentation channels (e.g., all presentation channels), wherein the average signal level avg [ n ] may be a weighted average. It should be appreciated that although the determination of the signal level data first and the subsequent determination of the maximum, minimum or average signal level max n, min n, avg n using the signal level data is described as two sub-steps, the maximum, minimum or average signal level max n, min n, avg n may be determined directly from the rendered audio channel as a single step.

In step S4, a presentation audio channel is selected for each time period in the set of presentation audio channels. For example, the presentation channel associated with the maximum signal level max [ n ] or the minimum signal level min [ n ] is selected by the signal level analyzer 32. Alternatively, step S4 may comprise determining an average signal level avg [ n ] of at least two presented audio channels by the signal level analyzer 32. For example, using the average signal level avg [ n ] may result in the dynamic range adjusted spatial audio object being compressed or expanded less aggressively (while possibly allowing some rendering channels to be above a target high signal level or below a target low signal level). The use of the maximum signal level max n or the minimum signal level min n may be effective to ensure that no channel is present above the target high signal level or below the target low signal level (whereas compression or expansion is aggressive and may result in no artifacts when using the average signal level avg n).

In step S5, the attenuation/gain calculator 33 determines attenuation or gain based on the signal level of the selected presentation signal (or the average signal level of two or more presentation signals), and outputs information indicating the determined attenuation or gain to the attenuation/gain application unit 22.

In some embodiments, step S5 involves gain calculator 33 comparing the signal level (e.g., max [ n ], min [ n ], or avg [ n ]) obtained from signal level analyzer 32 with an obtained threshold value, and calculating a gain that reduces peak max [ n ] to the threshold value, or increases minimum signal value min [ n ] to the threshold value. That is, the attenuation/gain calculator 33 may be configured to calculate a gain or attenuation for performing at least one of upward peak limiting and downward peak limiting to adjust the dynamic range of the spatial audio object 10.

In another embodiment, step S5 involves gain calculator 33 comparing the min [ n ] or avg [ n ] signal level obtained at step S4 with the obtained threshold, and if the min [ n ] or avg [ n ] signal level is below the threshold, gain calculator 33 indicates that the time period should be attenuated (e.g., completely muted). For example, such a gain calculator may be used to implement a downward expansion, such as completely muting any time period having an associated signal level below a threshold.

In step S6, the attenuation/gain application unit 22 applies attenuation/gain to the respective time period of each spatial audio object 10 to form a dynamic range adjusted spatial audio object x' _i [n]. The attenuation/gain application unit 22 forms, together with the optional delay unit 21, a main processing chain 20 that processes the spatial audio object (e.g. applies gain or attenuation) in a manner controlled by the side chains 30.

In some embodiments, the threshold value obtained at S31 is accompanied by an adjustment ratio coefficient indicating attenuation/gain to be applied for signal levels above/below the threshold value. Thus, the attenuation/gain calculated by the gain calculator 33 may act as a compressor or expander, with the adjustment ratio being such as 1:2, 1:3, 1:4, or generally 1: x (where x e (1, ++)). It should be understood that 1: the adjustment ratio of infinity will correspond to a peak limiter or a bottom limiter. For example, step S31 includes obtaining an adjustment ratio coefficient, and step S5 includes determining, with the attenuation/gain calculator 33, a threshold difference, which is a difference between a peak threshold and a signal level representation of the selected audio channel, and determining a limiting attenuation/gain from the threshold difference weighted with the adjustment ratio coefficient. The threshold and/or adjustment ratio may be based on a desired input/output curve, e.g., created by a user.

Dynamic range adjusted spatial audio object x 'created by attenuation/gain applicator 22 applying attenuation/gain' _i [n]May be archived, encoded, distributed, or rendered for direct listening. For example, dynamic range adjusted spatial audio object x' _i [n]May be provided to the storage unit 50a or sent to at least one presentation renderer 50b, such as a headphone speaker renderer (stereo renderer) or a 7.1.4 speaker renderer. Any other type of presentation renderer may also be used and is within the scope of the present disclosure.

It should be noted that although the spatial audio object has been rendered by the renderer 31 to a predetermined nominal presentation, the spatial audio object 10 may be rendered to a number of different presentations suitable for different speaker or headphone settingsNow, the process is performed. Spatial audio object x 'despite dynamic range adjustment' _i [n]Obtained by analysing a selected small number of rendered presentations, e.g. one rendered presentation, but even when the dynamic range is adjusted for the spatial object x' _i [n]Dynamic range adjusted spatial audio object x 'rendered to presentation other than the selected minority presentation used in the analysis' _i [n]The dynamic range adjustment of (2) will still complete the dynamic range adjustment.

For example, side chain 30 renders a spatial audio object as a 5.1.2 presentation that includes five ear speaker feeds, one Low Frequency Effect (LFE) signal, and two overhead speaker feeds, which is operated on by signal level analyzer 32 and gain calculator 33. The resulting time-varying attenuation/gain is applied in the attenuation/gain applicator 22 to the corresponding time period of the spatial audio object 10 to obtain a dynamic range adjusted spatial audio object x' _i [n]. Dynamic range adjusted spatial audio object x' _i [n]And may then be stored in memory 50a or rendered by rendering renderer 50b as any rendering (including a 5.1.2 rendering), such as a 2.0 rendering or a 7.1.4 rendering, that would characterize the dynamic range adjustment.

In some implementations, the audio processing system further includes a delay unit 21 configured to form a delayed version of the spatial audio object 10. The delay introduced by the delay unit 21 may be a delay corresponding to a delay introduced by the renderer 31, the signal level analyzer 32, and/or the gain calculator 33 of the side chain 30. The delay introduced by the renderer 31 may vary greatly depending on the presentation format output by the renderer. For a time domain renderer, the delay may be very short, e.g. zero or tens of samples, whereas a transform-based renderer (e.g. for rendering a binaural audio signal for headphones) may have a longer delay, ranging from hundreds to thousands of samples, e.g. from 500 to 2000 samples.

Fig. 3 illustrates an audio processing system for performing dynamic range adjustment of spatial audio objects 10 according to some implementations. As shown, the side chain 30 of the audio processing system comprises at least two renderers, e.g. three renderers 31a, 31b, 31c, wherein each renderer 31a, 31b, 31c is configured to obtain a plurality of spatial audio objects 10 and render the spatial audio objects into a respective rendered presentation, each rendered presentation comprising at least one rendering audio channel forming a set of rendering audio channels. Thus, the signal level analyzer 32 performs signal level analysis on more than one presentation. For example, when determining the max [ n ], min [ n ], or avg [ n ] signal level, signal level analyzer 32 determines max [ n ], min [ n ], or avg [ n ] on all presentation channels in a set of presentation channels including channels from two or more rendering presentations.

In some implementations, the signal level analyzer 32 determines max [ n ], min [ n ] or avg [ n ] over all presentation channels in the subset including at least two of the set of presentation channels. For example, signal level analyzer 32 may select a maximum signal level max [ n ] or a minimum signal level min [ n ] in each presentation and determine an average of the selected maximum signal level max [ n ] or minimum signal level mix [ n ].

For example, the renderer a 31a renders the spatial audio object 10 as a stereo presentation (sA, k, where k=1, 2), the renderer B31B renders the spatial audio object 10 as a 5.1 presentation (sB, k, where k=1, 2..6)), and the renderer C31C renders the spatial audio object 10 as a 7.1.4 presentation (sC, k, where k=1, 2..12). In this example, the signal level analyzer 32 performs analysis (e.g., determines max [ n ], min [ n ], or avg [ n ]) for 2+6+12=20 channels from three different rendering presentations.

Although the embodiment depicted in fig. 3 has three renderers 31a, 31b, 31c, any number of renderers, such as two renderers or at least four renderers, may be used instead of three renderers 31a, 31b, 31 c. Further, although the renderers 31a, 31b, 31c are depicted as separate renderers, two or more rendered audio presentations may be obtained by a single renderer configured to render the spatial audio object 10 as two or more presentations.

The attenuation/gain calculator 33 determines an attenuation/gain for each time period and supplies the determined attenuation/gain to the main chain 20 to be applied to the corresponding time period of the spatial audio object 10.

In some embodiments, the same threshold is used for each of the at least two presentations sA, k, sB, k, sC, k. In other implementations, separate thresholds are obtained for each of the at least two presentations, where the attenuation/gain is based on the threshold for each presentation and the selected presentation audio channel. Thus, the threshold may be set globally for all presentations, separately for each presentation, or for each subset of presentations. For example, one subset may include presentations intended to be played using headphones or earplugs, while another subset includes presentations intended to be played using speakers in a surround system.

For example, the gain calculator 33 calculates the attenuation/gain based on the threshold level of the first presentation and the selected presentation audio channel in combination with the threshold level of the second presentation and the selected presentation audio channel. Combining the selected rendering audio channel and the threshold level of the at least two rendering audio channels may for example comprise calculating an average (or weighted average) of the attenuation/gain calculated for each rendering. For example, when calculating the attenuation for achieving the downward compression, the gain calculator 33 compares the signal level of the selected audio channel with a first threshold and determines a first attenuation a required for compressing the first presentation ₁ . Similarly, gain calculator 33 determines that compressing the second presentation requires a second attenuation A ₂ Whereby the signal calculator 33 calculates the first and second attenuations a ₁ 、A ₂ Such as an average or weighted average) that is applied by the attenuation/gain applicator 22.

The threshold for each presentation may be determined, for example, by considering the threshold at which the downmix of spatial audio objects in each presentation is obtained from a single.

In some implementations (not shown), each renderer 31a, 31b, 31c is associated with a separate signal level analyzer 32 and/or a separate gain calculator 33. For example, each renderer 31a, 31b, 31c is associated with a separate signal level analyzer 32 that outputs signal levels min [ n ], max [ n ], avg [ n ] to a common gain calculator 33. Further, it is contemplated that each renderer 31a, 31b, 31c is associated with a separate signal level analyzer 32 and a separate gain calculator 33, whereby the gains of the separate gain calculators 33 are combined (e.g., by averaging, weighted averaging, minimum selection, maximum selection) such that the combined gain is provided to the attenuation/gain applicator 22.

Fig. 4 illustrates an audio processing system for performing dynamic range adjustment of spatial audio objects 10 according to some implementations. In the side chain 30, the spatial audio objects 10 are provided to at least one renderer 31 to form one or several rendered audio presentations. Each rendered audio presentation is provided to an analysis filter bank 41b in the side chain 30, which extracts at least two subband representations of each rendered audio presentation. In the depicted embodiment, analysis filter bank 41b extracts three subband representations of each rendered presentation output by at least one renderer 31, but two or at least four subband representations may be used in a similar manner. For each sub-band representation, a separate signal level analyzer 32a, 32b, 32c and gain calculator 33a, 33b, 33c are provided to determine the respective attenuation/gain to be applied to the corresponding time period and sub-band representation of the spatial audio object 10. To this end, an analysis filter bank 41a is used to extract the corresponding subband representation of the spatial audio object 10.

In the main chain 20, separate attenuation/gain applicators 22a, 22b, 22c (each subband represents one attenuation/gain applicator) obtain a subband representation of the spatial audio object and the gains calculated by the gain calculators 33a, 33b, 33c to form a subband representation of the dynamic range adjusted spatial audio object. Finally, synthesis filter bank 42 is used to combine sub-band representations of dynamic range adjusted spatial audio objects into a single set of dynamic range adjusted spatial audio objects, which are stored or provided to any rendering renderer.

The signal level analyzer 32a, 32b, 32c and the gain calculator 33a, 33b, 33c of each subband representation may be identical to the signal level analyzer 32 and the gain calculator 33 described in other parts of the present application. That is, the step of selecting the highest/lowest rendering channel or determining the average signal for each time period is performed in parallel for each subband representation. Similarly, attenuation/gain is determined for each subband representation and applied by the respective attenuation/gain applicator 22a, 22b, 22 c.

Furthermore, the same threshold is used for each sub-band representation, or alternatively, a different threshold is obtained for each sub-band representation. In addition, the side-chain parameters and output renderer parameters described in connection with fig. 6 may be the same in all sub-band representations or may be defined separately for each sub-band representation.

It should be appreciated that although the plurality of renderers of fig. 3 and the plurality of frequency bands of fig. 4 are each described as separate audio processing systems, they may form part of the same system. For example, an audio processing system comprising two or more renderers 31 is considered an implementation in which at least two signal level analyzers 32a, 32b, 32c operate on different sub-band representations of each presentation. In addition, it should be appreciated that the backbone 20 may include one or more delay units to introduce a delay to compensate for any delay introduced by the side chains 30.

Fig. 5 depicts a variation of the audio processing system of fig. 1. The side chains 130 in fig. 5 include the calculation and application of slow and/or fast gains. The slow gain varies relatively slowly with time, while the fast gain varies faster with time. Calculating and applying both fast and slow gains has proven to be an effective method of eliminating digital "overflow", which refers to, for example, signal levels above the maximum digital audio samples that a digital system can represent.

For both slow and fast gains, the renderer(s) 131 receive the spatial audio object 10 and render the spatial audio object 10 into at least one audio presentation. At least one rendered audio presentation is provided to a signal level analyzer, such as a min/max analyzer 132, which extracts the minimum signal level or the maximum signal level for each time period on all the rendered audio channels. Instead, the min/max analyzer 132 is replaced with an average signal analyzer that extracts the average signal level over all presentation channels, or the average signal level of the highest/lowest presentation channel in each rendering presentation.

In the foregoing example, the min/max analyzer 132 will be assumed to be a peak analyzer configured to determine peak signal values p [ n ] on the rendered audio channels, which enables the audio processing system to perform peak limiting and/or downscaling of the spatial audio objects. However, these examples similarly apply to the min/max analyzer 132 configured to determine an average signal level over two or more presentation channels. Additionally or alternatively, the min/max analyzer 132 may be configured to determine a presentation channel associated with the lowest signal level min [ n ], which enables the audio processing system to perform, for example, upward compression (e.g., bottom limiting) or downward expansion, such as muting a time period when the minimum or average signal level is below a threshold level.

The peak analyzer determines the peak signal value p [ n ] for each time period as

To calculate the slow gain g _s [n]Peak signal value p [ n ] for each time period]Is provided to a control signal extractor 133, the control signal extractor 133 being configured to, at a given peak signal value p [ n ]]And extracting the control signal c [ n ] for each time period with the threshold T]. In one implementation, the control signal extractor 133 calculates the control signal as:

This means that if no presentation channel exceeds the threshold T, the control signal c n]Will be zero. The slow gain calculator 135 uses the control signal c [ n ]]To calculate a slow gain g to be applied to the spatial audio object 10 by the slow gain applicator 122a _s [n]。

Optionally, the control signal extractor 133 is followed by a start/release processor 134 for modifying the control signal c [ n ] to maintain a predetermined attenuation/gain adjustment rate (rate). The start/release processor 134 obtains an adjustment rate parameter indicating the maximum rate of change (i.e., derivative) of the applied decay/gain between two adjacent time periods and creates a modified control signal c' n configured such that the resulting decay/gain changes at the maximum rate of change indicated by the adjustment rate parameter.

In some embodiments, the adjustment rate parameter is at least a first adjustment rate parameter and a second adjustment rate parameter, wherein the first adjustment rate parameter indicates a start time constant t _a And wherein the second adjustment rate parameter indicates a release time constant t _r . For a start time constant and a release time constant t _a 、t _r The onset coefficient alpha and release coefficient beta can be obtained as follows

Wherein f _s Is the sampling rate of the rendered audio presentation and/or spatial audio object 10. Subsequently, a modified control signal c' [ n ]Calculated by the start/release processor 134 as:

now, the slow gain calculator 135 uses c' [ n ] from the start/release processor 134]Will slow gain g _s [n]Calculated as

Or alternatively c' n is replaced with c n if the optional start/release process at 134 is omitted. Further, it is noted that although the extraction of the control signal c [ n ] is convenient to describe the extraction of the slow gain, it is not necessary to extract the control signal explicitly. From equation 3 it can be seen that there is a direct link between the peak level p n and the control signal c n, which means that c n can always be replaced by a function that depends on p n.

Slow gain g _s [n]Is provided to slow gain applicator 122a, slow gain applicator 122a applies slow gain to a corresponding period of time of spatial audio object 10. In some implementations, the slow gain calculator 122a obtains an indication of the slow gain g _s [n]To what extent the control parameter p is to be applied. For example, the adjustment control parameter ρ is within the interval 0+.ρ+.1, and can be fixed or set by the user (e.g., mastering engineer). The slow gain calculator 122a is based on the control signal c [ n ]]Or c' [ n ]]Adjusting the control parameter ρ to calculate a partial slow gain g' _s [n]And will be partially slow gain g' _s [n]A slow gain applicator 122a provided to the backbone 120 that applies the portion of the slow gain g' _s [n]Applied to the spatial audio object 10. For example, a partial slow gain g' _s [n]Is calculated as

Or alternatively, a partial slow gain g' _s [n]Is calculated as:

wherein c' n is replaced with c n if the start/release process at 134 is omitted.

In another embodiment, not shown, the start/release processor 134 sets the slow gain g that has been extracted without the start/release process _s [n]Or g' _s [n]Operate in conjunction with control signals c n]In contrast to performing the start/release process, the start release processor 134 is configured to directly apply the slow gain g _s [n]Or g' _s [n]The start/release process is performed.

Slow increaseYig (Chinese character) _s [n]Or a partial slow gain g' _s [n]Is provided to slow gain applicator 122a, slow gain applicator 122a applies slow gain g _s [n]Or a partial slow gain g' _s [n]Applied to each corresponding time period (and sub-band representation) of the spatial audio object to form a dynamic range adjusted spatial audio object x' _i [n]。

In some embodiments, the slow gain g _s [n]The calculation and application of (a) is accompanied by a fast gain g _f [n]Is described. Instead, only the fast gain g is calculated _f [n]And slow gain g _s [n]One and applies it to each time period of the spatial audio object. The fast gain g is described in more detail below _f [n]。

By calculating the slow gain g by the slow gain calculator 135 _s [n](or modified slow gain g' _s [n]) Slow gain g _s [n]With threshold T and peak signal level p [ n ]]Together provided to a modified min/max calculator 136. The modified min/max calculator 136 calculates the modified peak level p' [ n ]]For example by setting or using g' _s [n]Replacement g _s [n]。

The modified peak level p ' [ n ] is further processed by a look ahead (look ahead) smoother 137, which smoother 137 computes the smoothed modified peak level p ' [ n ], for example by convolving the modified peak level p ' [ n ] with a smoothing kernel w [ m ] having m elements. Ideally, the elements of the smoothing kernel w [ m ] satisfy units and constraints:

1＝∑ _m w[m](11)

For example w [ m ]]＝[0.25,0.25,0.25,0.25]. Then, a fast gain g is calculated from the smoothed modified peak value _f [n]Is that

Thereby gaining a fast gain g _f [n]Is provided to a fast gain applicator 122b, the fast gain applicator 122b applies a fast gain g _f [n]Applied to a slow gain g already applied by slow gain applicator 122a _s [n]A processed spatial audio object.

In some embodiments, the modified peak level p' [ n ] ]Stored in a first circular peak buffer b of length M ₁ In (a)

b ₁ [m％M]＝p′[n](13)

Where% indicates integer modulo operators. Second circular buffer b of length M ₂ The maximum peak level observed in the first-cycle peak buffer is stored. Thus, the second circular peak buffer b ₂ Is obtained as

The lead smoother 137 may be configured to obtain the smoothed modified peak level p' [ n ] by convolving the smoothing kernel with a second circular buffer.

That is, the smoothed modified peak level p' [ n ] is obtained as

p″[n]＝∑ _m w[m]b ₂ [(n-m)％M](15)

And is provided to a fast gain calculator 138, the fast gain calculator 138 calculating a fast gain g according to the above-mentioned 12 _f [n]And will gain g fast _f [n]To the fast gain applicator 122b.

Lead amount and/or circulation buffer b ₁ 、b ₂ The length of (c) may be set by the user as a side chain parameter. Similarly, the kernel w [ m ] is smoothed]The length, lead and/or individual element values of (c) may be determined by the user as side-chain parameters to establish a desired dynamic range adjusted spatial audio object x' _i [n]。

Also depicted in fig. 5 are two delay units 121a, 121b of the main chain 120, the delay units 121a, 121b being configured to introduce a respective delay to the spatial audio object 10 such thatFast gain g _f [n]And slow gain g _s [n]Is applied to the corresponding time period. An initial delay of K time periods (e.g., K samples) is applied to the spatial audio object 10 by the first delay unit 121a to compensate for any rendering delay or advance introduced by the renderer(s) 131, the min/max analyzer 132, the control signal extractor 133, the start/release processor 134, and the slow gain calculator 135. Similarly, the second delay unit 121b applies a second delay of M time periods (e.g., M samples) to compensate for any lead or delay introduced by the modified min/max calculator 136, the lead smoother 137, and the fast gain calculator 138. The delays K and M introduced by the delay units 121a, 122b are typically in the range of tens to thousands of time periods (samples). For example, depending on the type of presentation(s) output by the renderer(s) 131 as described above, the delay K introduced by the first delay unit 121a is between tens and thousands of time periods (samples). The delay M introduced by the second delay unit 121b is typically about 1 to 5 milliseconds mainly due to the amount of advance in the advance smoother 137. For example, for a 1 millisecond lead of a 32kHz sampled audio channel, delay M is 32 time periods (samples), while for a 5 millisecond lead of a 192kHz sampled audio channel, delay M is approximately 1000 time periods (samples).

In one particular implementation, the renderer(s) 131 are Object Audio Renderers (OAR) employing lightweight preprocessing, and for lead, delay for k=512 time periods (samples) is used with fast gain delay m=64. If the lightweight pre-processing is replaced with spatial encoding, the delay K may be increased to, for example, 1536, however, it is contemplated that for different and/or future pre-processing schemes and OAR rendering techniques, the delay K may be reduced below 1536, even near or up to the delay of a zero time period (sample). Thus, dynamic range adjusted spatial audio object x' _i [n]Can be obtained as

x′ _i [n]＝x _i [n-M-K]g _f [n-K]g _s [n-M-K]16, a base plate and a cover plate

Or alternatively, with g' _s [n-M-K]Replacement g _s [n-M-K]。

Fig. 6 shows a user 70, such as a mastering or mixing engineer, mastering a spatial audio object 10 using the above-described audio processing system. Delay element(s) 21 and attenuation/gain applicator 22 form a backbone 20 and involve applying a fast gain g in one or more subband representations _f [n]And slow gain g _s [n]One or more of the above, as described above. Similarly, side chain 30 is any of the different side chain implementations described above.

When mastering spatial audio objects 10, user 70 may set or adjust side-chain parameters 72, which side-chain parameters 72 include a threshold T (which may be a single value, or set for each sub-band representation or each rendered representation in the side-chain), an adjustment rate (maximum rate of change or start/release time T _a 、t _r ) One or more of the adjustment control parameter ρ, the number of renderers in the side chain 30, the type of renderers in the side chain 30, the number and/or frequency (cut-off frequency, bandwidth) of sub-band representations in the side chain 30, and the lead amount, e.g., in the lead smoother 137. Although backbone 20 operation has some delay introduced by delay unit(s) 21, any changes made by user 70 to side-chain parameters 72 will be at the spatial audio object x 'of the dynamic range adjustment output by backbone 20' _i [n]Corresponding changes are introduced. Dynamic range adjusted spatial audio object x' _i [n]Rendered by the output renderer 60 as a selected one or more audio presentations (e.g., a stereo presentation and/or 5.1 presentation) that are listened to by the user 70. Thus, the user 70 can adjust the side-chain parameters 72 and hear the tuned results quickly to facilitate obtaining the desired results (i.e., the mastered spatial audio object). In some implementations, the output renderer 60 adjusts the dynamic range of the spatial audio object x' _i [n]Rendering to two or more presentations in parallel allows the user 70 to quickly switch between different rendering presentations when adjusting the side-chain parameters 72. To this end, the user may adjust output renderer parameters 60 that affect the number and type of output renderers (and the presentation of the audio system currently being provided to the user 70 for use).

The renderer(s) in the side chain 30 and their respective output presentations may be set based on different criteria highlighted below.

The renderer(s) in the side chain 30 and their output presentation format(s) may be set by input of the user 70.

The renderer(s) in the side chain 30 and their output presentation format(s) may be selected to encompass one or more presentations that are expected to be the most common presentation for consumption of the content of the spatial audio object 10. For example, if the content is music, the renderer(s) in the side chain 30 are configured to render a stereo presentation, and if the content is an audio track of a movie, the renderer(s) in the side chain 30 are configured to render a stereo presentation and a 5.1 presentation.

The renderer(s) in the side chain 30 and their output presentation format(s) may be selected to represent the worst case in terms of digital overflow risk. For example, the presentation format(s) with the highest peak level is selected among two or more alternative presentation formats.

The renderer(s) in the side chain 30 and their output presentation format(s) may be selected to represent all or substantially all of the number of possible renderers and presentation format(s) to be used in content consumption. Thus, dynamic range adjusted spatial audio object x' _i [n]Ensuring that the presentation of the spatial audio objects does not overflow at all.

The renderer(s) in the side chain 30 and their output rendering format(s) may be based on rendering importation to a dynamic range adjusted spatial audio object x 'output by the main chain 20' _i [n]And which may be apparent from the presentation output by the output renderer 60. The sound characteristic includes at least one of: perceived impact, sharpness, loudness, harmonic distortion or saturation, intermodulation distortion, transient extrusion or enhancement, or dynamic enhancement. For example, the user 70 loops through the various presentation formats in the side chain 30 to determine which presentation formats provide a response for analyzing the attenuation/gain introduced by the side chain 30With the best basis for the modification of the introduced sound characteristics.

Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present disclosure, discussions utilizing terms such as "processing," "computing," "calculating," "determining," "analyzing," or the like, refer to the action and/or processes of a computer hardware or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical quantities (e.g., electronic quantities) into other data similarly represented as physical quantities.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention. Furthermore, while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are within the scope of the invention and form different embodiments, as will be appreciated by those of skill in the art. For example, in the appended claims, any of the claimed embodiments may be used in any combination.

Furthermore, some embodiments are described herein as a method or combination of elements of a method that may be implemented by a processor of a computer system or by other means of performing a function. Thus, a processor having the necessary instructions for performing such a method or element of a method forms a means for performing the method or element of a method. Note that when the method includes a plurality of elements (e.g., a plurality of steps), the order of the elements is not implied unless specifically stated. Furthermore, the elements of the apparatus embodiments described herein are examples of means for performing the functions performed by the elements for the purpose of carrying out the invention. In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Thus, while particular embodiments of the present invention have been described, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as fall within the scope of the invention. For example, the fast gain g described in connection with FIG. 5 _f [n]And slow gain g _s [n]The determination and different alternatives of the application may be performed in parallel for two or more sub-band representations (as described above in connection with fig. 4) and/or across rendering audio channels from two or more rendering presentations (as described above in connection with fig. 3). In addition, the min/max analyzer 132 in fig. 5 may also be included in the signal level analyzers 32, 32a, 32b, 32c of fig. 1, 3, and 4. Similarly, the control signal extractor 331, the start/release processor 333, and the slow gain calculator 334 of fig. 5 may also be included in the decay/gain calculators 33, 33a, 33b, 33c of fig. 1, 3, and 4.

Various features and aspects may be appreciated from the following enumerated example embodiments ("EEEs"):

eee1. A method for dynamically changing the level of one or more object-based audio signals of an object-based input audio asset, wherein the method comprises: receiving an object-based input audio asset; rendering the object-based input audio asset into one or more presentations using one or more audio renderers; determining one or more metrics of the one or more presented signal levels; calculating a gain or attenuation in response to the one or more signal level metrics; the calculated gain or attenuation is applied to at least one of the one or more object-based audio signals to produce an object-based output audio asset.

A method of eee 2.eee1 wherein rendering object-based input audio assets into one or more presentations includes generating one or more speaker or headphone presentations.

The method of eee3.Eee 1 or 2 wherein determining one or more metrics of the signal level comprises detecting a peak signal level or an average signal level.

The method of any one of eees 1-3, wherein the attenuation is based on a control signal determined from one or more measured signal levels.

The method of any of eees 1-4, wherein the calculated gain or attenuation is configured to reduce a peak level in one or more rendered presentations.

The method of any of eees 1-5, wherein the calculated gain or attenuation is based on a desired input-output curve.

The method of any of eees 1-6, further comprising modifying one or more parameters for rendering the object-based input audio asset, for determining one or more metrics of signal level, for calculating gain or attenuation, and/or for listening in real-time to the object-based output audio asset.

The method of EEE8.EEE 7, when dependent on EEE4, further comprises modifying one or more parameters for calculating the control signal.

The method of any of eees 1-7, wherein rendering the object-based input audio asset into one or more presentations using one or more audio renderers comprises: the object-based input audio asset is converted to one or more presentations in a frequency-invariant manner.

Eee10. The method of eee 9 wherein the conversion is applied in two or more frequency bands of the object-based input audio asset.

The method of any of eees 1-10, wherein calculating the gain or attenuation in response to one or more signal level metrics is based on at least one control parameter comprising at least one of a start time constant, a release time constant, a maximum amplitude, a threshold, or a proportion of the gain or attenuation to be applied.

The method of any of eees 1-11, wherein calculating the gain or attenuation in response to one or more signal level metrics comprises calculating a fast gain and a slow gain.

The method of eee13. Eee12 wherein calculating the fast gain and/or the slow gain is based on at least one control parameter comprising at least one of a start time constant, a release time constant, a maximum amplitude, a threshold, or a proportion of gain or attenuation to be applied.

The method of any of EEEs 1-13, wherein the one or more audio renderers and one or more respective output presentation formats of the one or more audio renderers are configured to be selected based on criteria including at least one of: (a) end user input, (b) end user preference, (c) audience likelihood of consuming one or more presentations, (d) worst case of expected peak levels on two or more alternatives (e) running multiple ones of one or more audio renderers and/or one or more corresponding output presentation formats in parallel to ensure that one or more corresponding output presentations have peak levels above a threshold, or (f) end user selection among multiple options to obtain a particular sound characteristic.

Eee15. The method of eee 14 wherein the plurality of options includes at least one of: specific perceived impact, sharpness, loudness, harmonic distortion or saturation, intermodulation distortion, transient compression, or dynamic enhancement.

Eee16. A system for dynamically changing the level of one or more object-based audio signals of an object-based input audio asset, wherein the system comprises: one or more renderers configured to receive object-based input audio assets; rendering the object-based input audio asset into one or more presentations; and a peak analyzer configured to determine one or more metrics of the one or more presented signal levels; a gain analyzer configured to calculate a gain or attenuation in response to the one or more signal level metrics; and wherein the calculated gain or attenuation is applied to at least one of the one or more object-based audio signals to produce an object-based output audio asset.

The system of eee17. Eee16 further comprises a delay unit configured to compensate for one or more delays introduced by the one or more renderers.

EEE 18.EEE 17 wherein one or more renderers comprise at least two renderers operating in parallel.

The system of EEE 19.eee18 wherein the peak analyzer is further configured to calculate a control signal derived from outputs of at least two renderers operating in parallel.

Eee20. The system of eees 19 wherein the gain analyzer is configured to calculate the gain or attenuation in response to the one or more signal level metrics based on the calculated control signal.

Claims

1. A method for performing dynamic range adjustment of a spatial audio object (10), the method comprising:

obtaining (step S1) a plurality of spatial audio objects (10);

obtaining (step S2) at least one rendered audio presentation of the spatial audio object (10), the at least one rendered audio presentation comprising at least one rendered audio channel forming a set of rendered audio channels;

determining (step S3) signal level data associated with each presentation audio channel of the set of presentation audio channels, wherein the signal level data represents signal levels of a plurality of time periods of the presentation audio channel;

obtaining (step S31) a threshold value;

for each time period:

selecting (step S4) a selected presentation audio channel, wherein the selected presentation audio channel is a presentation audio channel of the set of presentation audio channels that is associated with a highest signal level or a lowest signal level of the time period compared to other presentation audio channels of the set of presentation audio channels, and

Determining (step S5) a gain based on the threshold and a representation of the signal level of the selected audio channel; and

the gain for each time segment is applied (step S6) to the corresponding time segment for each spatial audio object to form a dynamic range adjusted spatial audio object.

2. The method of claim 1, further comprising:

obtaining an adjustment proportionality coefficient; and wherein determining the gain for each time period comprises:

determining a threshold difference, the threshold difference being the difference between the threshold and a signal level representation of the selected audio channel; and is also provided with

A gain is determined based on the threshold difference and the scaling factor.

3. The method of claim 1, wherein the gain attenuates the signal level of the selected presentation channel to the threshold, or wherein the gain amplifies the signal level of the selected presentation channel to the threshold.

4. A method according to claim 3, further comprising:

obtaining an adjustment control parameter, wherein the adjustment control parameter indicates a scaling factor of the gain; and

the scaling factor is applied to the gain.

5. The method of any of the preceding claims, wherein the signal level data for each time period comprises a signal level representation of a plurality of frequency bands of the presentation audio channel, the method further comprising:

Selecting, for each time period and frequency band, a presentation audio channel of the set of presentation audio channels;

determining a gain for each time period and frequency band, the gain for each frequency band being based on the threshold and a representation of the time period and frequency band selected to present the signal level of the audio channel;

the gain for each frequency band and time segment is applied to the corresponding time segment and frequency band for each spatial audio object to form a dynamic range adjusted spatial audio object.

6. The method of any of the preceding claims, wherein each rendered audio presentation comprises at least two presentation audio channels.

7. The method of any of the preceding claims, wherein at least two rendered presentations are obtained, wherein each rendered audio presentation comprises at least one presentation audio channel.

8. The method of claim 7, wherein the gain is further based on a representation of a signal level of a second selected presentation audio signal, wherein the second selected presentation audio signal is presented by a second rendering different from the rendering of the selected audio channel.

9. The method of claim 8, further comprising:

obtaining a second threshold for each of the at least two rendered presentations;

Wherein the gain is further based on a combination of:

signal level representation and threshold value of selected audio signal, and

the signal level representation of the second selected audio channel and a second threshold.

10. The method of any of the preceding claims, further comprising:

obtaining an adjustment rate parameter indicating a maximum rate of change of gain between two adjacent time periods, an

Wherein the gain is further based on the adjustment rate parameter such that the gain is varied at a maximum rate of variation indicated by the adjustment rate parameter.

11. The method of claim 10, wherein the adjustment rate parameter is at least a first adjustment rate parameter and a second adjustment rate parameter,

wherein the first adjustment rate parameter indicates a start time constant,

wherein the second adjustment rate parameter indicates a release time constant, an

Wherein the gain is further based on the start time constant and the release time constant such that the gain varies at a maximum rate of change indicated by the start time constant and the release time constant, respectively.

12. The method of any of the preceding claims, further comprising:

determining a modified signal level representation for each time period, wherein the modified signal level representation is based on a signal level representation of the selected rendered audio channel to which the gain is applied;

Determining a smoothed modified signal level representation for each time segment by convolving the modified signal level representation for each time segment with a smoothing kernel;

calculating a smoothing gain based on the smoothed modified signal level representation for each time period; and

the smoothing gain for each time period is applied to a corresponding time period of each dynamic range adjusted spatial audio object to form an enhanced dynamic range adjusted spatial audio object.

13. The method of claim 12, further comprising:

storing the modified signal level representation of the successive time periods in a first circular buffer of length M;

storing the maximum or minimum modified signal level representation of the first circular buffer in a second circular buffer of length M;

wherein determining the smoothed modified signal level representation for each time period includes convolving the second circular buffer with a smoothing kernel.

14. The method of any of the preceding claims, wherein the signal level representation of each time period of each presented audio channel is selected from the group comprising:

the RMS representation of the signal level for that time period,

the amplitude of the period of time is such that,

The maximum amplitude of this time period,

average amplitude of the time period, and

the minimum amplitude of this period.

15. The method of any of the preceding claims, wherein the at least one rendering presentation is a rendering presentation selected from the group comprising:

a mono-based presentation is presented in which,

a stereo-sound presentation is performed with the aid of a stereo-sound,

a binaural rendering is performed in which,

5.1 the presentation of the picture is made,

7.1 the presentation of the sample,

5.1.2 the presentation is made,

5.1.4 of the present invention,

7.1.2 of the present invention,

7.1.4 of the present invention,

9.1.2 of the present invention,

9.1.4 is presented in the form of a tube,

9.1.6 is presented, and

multi-channel presentations, such as 22.2, with at least three height levels.

16. An audio processing system for dynamic range adjustment, comprising:

at least one renderer (31, 31a, 31b, 31 c) configured to obtain a plurality of spatial audio objects (10) and render the spatial audio objects into a rendering presentation comprising at least one presentation audio channel forming a set of rendering presentation audio channels;

a signal level analysis unit (32, 32a, 32b, 32 c) configured to determine signal level data associated with each presentation audio channel of the set of presentation audio channels, wherein the signal level data represents signal levels of a plurality of time periods of the presentation audio channels; and

A gain calculator (33, 33a, 33b, 33 c) configured to:

obtaining a threshold value;

selecting a presentation audio channel, wherein the selected presentation audio channel is a presentation audio channel of the set of presentation audio channels that is associated with a highest signal level or a lowest signal level of the time period compared to other presentation audio channels of the set of presentation audio channels, and

determining a gain for each time period, the gain being based on the threshold and a signal level representation of the selected audio channel; and

a gain applicator (22, 22a, 22b, 22 c) is configured to apply the gain for each time period to a corresponding time period of each spatial audio object to form a dynamic range adjusted spatial audio object.

17. The audio processing system of claim 16, further comprising:

a delay unit (21) configured to obtain a plurality of spatial audio objects (10) and to generate delayed spatial audio objects corresponding to the spatial audio objects, wherein the delay introduced by the delay unit corresponds to the delay introduced by at least one renderer (31, 31a, 31b, 31 c) and

wherein the gain applicator (22, 22a, 22b, 22 c) is configured to apply the gain for each time period to a corresponding time period of each delayed spatial audio object to form a dynamic range adjusted spatial audio object.

18. The audio processing system of claim 16 or 17, wherein each rendered presentation comprises at least two presentation audio channels.

19. The audio processing system of any of the preceding claims, comprising at least two renderers (31, 31a, 31b, 31 c), wherein each renderer (31, 31a, 31b, 31 c) is configured to obtain a plurality of spatial audio objects (10) and render the spatial audio objects to a respective rendered presentation, each rendered presentation comprising at least one rendering audio channel forming a set of rendering audio channels.

20. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to perform the steps of the method according to any one of claims 1 to 15.