WO2017207465A1 - A method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position - Google Patents

A method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position Download PDF

Info

Publication number
WO2017207465A1
WO2017207465A1 PCT/EP2017/062848 EP2017062848W WO2017207465A1 WO 2017207465 A1 WO2017207465 A1 WO 2017207465A1 EP 2017062848 W EP2017062848 W EP 2017062848W WO 2017207465 A1 WO2017207465 A1 WO 2017207465A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
spatial position
channels
audio object
value
Prior art date
Application number
PCT/EP2017/062848
Other languages
English (en)
French (fr)
Inventor
Giulio Cengarle
Antonio MATEOS SOLÉ
Original Assignee
Dolby International Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby International Ab filed Critical Dolby International Ab
Priority to CN201780033796.2A priority Critical patent/CN109219847B/zh
Priority to US16/303,415 priority patent/US10863297B2/en
Priority to CN202310838307.8A priority patent/CN116709161A/zh
Priority to EP17726613.7A priority patent/EP3465678B1/en
Publication of WO2017207465A1 publication Critical patent/WO2017207465A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing

Definitions

  • This disclosure falls into the field of object-based audio content, and more specifically it is related to the field of conversion of multi channel audio content into object-based audio content.
  • This disclosure further relates to method for processing a time frame of an audio content having a spatial position.
  • audio content of multi-channel format (stereo, 5.1 , 7.1 , etc.) are created by mixing different audio signals in a studio, or generated by recording acoustic signals simultaneously in a real environment.
  • the mixed audio signal or content may include a number of different sources.
  • Source separation is a task to identify information of each of the sources in order to reconstruct the audio content, for example, by a mono signal and metadata including spatial information, spectral information, and the like
  • legacy audio content i.e. 5.1 or 7.1 content
  • object-based audio content By providing tools for transforming legacy audio content, i.e. 5.1 or 7.1 content, to object-based audio content, more movie titles may take advantage of the new ways of rendering audio.
  • Such tools extract audio objects from the legacy audio content by applying source separation to the legacy audio content.
  • figure 1 a shows a first example of object extraction from a multichannel audio signal with channels in a first configuration, and rendering of the extracted audio object back to a multichannel audio signal with channels in the first configuration
  • figure 1 b shows a second example of object extraction from a multichannel audio signal with channels in a first configuration, and rendering of the extracted audio object back to a multichannel audio signal with channels in the first
  • figure 2 shows a device for converting a time frame of an multichannel audio signal into output audio content comprising audio objects, metadata comprising a spatial position for each audio object, and bed channels, according to embodiments of the disclosure
  • FIGS 3a-b show by way of example an embodiment of the risk estimation stage of the device of figure 2,
  • figure 3c shows a function used by the risk estimation stage of figure 3, for determining a fraction of an extracted object to include in the output audio object content
  • figure 4 shows by way of example an embodiment of the risk estimation stage of the device of figure 2
  • figure 5 shows by way of example an embodiment of an artistic preservation stage of the device of any of one of figures 2-4,
  • figure 6 shows by way of example, an embodiment of an artistic preservation stage of the device of any of one of figures 2-4,
  • FIGS. 7-10 show a method for spreading objects positioned on screen to map them to an arch encompassing the screen, according to embodiments of the disclosure
  • FIGS. 1 1 -13 show a method for boosting subtle audio objects and bed channels which are positioned out of screen
  • figure 14-15 show a method for increasing the z-coordinate of audio objects positioned in the rear part of a room
  • figure 16 shows a method for converting a time frame of a multichannel audio signal into output audio content comprising audio objects according to embodiments of the disclosure
  • figure 17 shows by way of example a coordinate system used in the present disclosure
  • figure 18 show by way of example a device for processing a time frame of an audio object, according to embodiments of the present disclosure.
  • example embodiments propose methods for converting a time frame of a multichannel audio signal into output audio content comprising audio objects, devices implementing the methods, and computer program product adapted to carry out the method.
  • the proposed methods, devices and computer program products may generally have the same features and advantages.
  • f) upon determining that the risk does not exceed the threshold include the audio object and metadata comprising the spatial position of the audio object in the output audio content (e.g., output audio object content).
  • the method may further comprise, upon determining that the risk exceeds the threshold, rendering at least a fraction (e.g., non-zero fraction) of the audio object to the bed channels.
  • a fraction e.g., non-zero fraction
  • the method may further comprise, upon determining that the risk exceeds the threshold, processing the audio object and the metadata comprising the spatial position of the audio object to preserve artistic intention (e.g., by providing said audio object and said metadata to an artistic preservation stage).
  • the multichannel audio signal may be configured as a 5.1 - channel set-up or a 7.1 -channel set-up, which means that each channel has a predetermined position pertaining to a loudspeaker setup for this configuration.
  • the predetermined position is defined in a predetermined coordinate system, i.e. a 3d coordinate system having an x component, a y component and a z component.
  • a bed channel is generally meant an audio signal which corresponds to a fixed position in the three-dimensional space (predetermined coordinate system), always equal to the position of one of the output speakers of the corresponding canonical loudspeaker setup.
  • a bed channel may therefore be associated with a label which merely indicates the predetermined position of the corresponding output speaker in a canonical loudspeaker layout.
  • the extraction of objects may be realized e.g. by the Joint Object Source Separation (JOSS) algorithm developed by Dolby Laboratories, Inc.
  • JOSS Joint Object Source Separation
  • such extraction may comprise performing an analysis on the audio content (e.g., using JOSS)
  • PCA Principal Component Analysis
  • the inventors have realized that when transforming legacy audio content, i.e. channel-based audio content, to audio content comprising audio objects, which later may be rendered back to a legacy loudspeaker setup, i.e. a 5.1 -channel set-up or a 7.1 -channel set-up, the audio object, or the audio content of the audio object, may be rendered in different channels compared to what was initially intended by the mixer of the multichannel audio signal. This is thus a clear violation of what was intended by the mixer, and may in many cases lead to a worse listening experience.
  • legacy audio content i.e. channel-based audio content
  • a legacy loudspeaker setup i.e. a 5.1 -channel set-up or a 7.1 -channel set-up
  • Such estimation is advantageously done based on the estimated spatial position of the audio object, since specific areas or positions in the three-dimensional space often means an increased (or decreased risk) of faulty rendering.
  • estimating a risk should, in the context of present specification, be understood that this could result in for example a binary value (0 for no risk, 1 for risk) or a value on a continuous scale (e.g., from 0-1 or from 0-10 etc.).
  • the step of "determining whether the risk exceeds a threshold” may mean that it is checked if the risk is 0 or 1 , and if it is 1 , the risk exceeds the threshold.
  • the threshold may be any value in the continuous scale depending on the implementation.
  • the number of audio objects to extract may be user defined, or predefined, and may be 1 , 2, 3 or any other number.
  • the step of estimating a risk comprises the step of: comparing the spatial position of the audio object to a predetermined area.
  • the risk is determined to exceed the threshold if the spatial position is within the predetermined area.
  • an audio object positioned in an area along or near a wall i.e., an outer bounds in the three-dimensional space of the predetermined coordinate system
  • areas along or near a wall which comprises more than two predetermined positions for channels in the multichannel audio signal may be a such a predetermined area.
  • the predetermined area may include the predetermined positions of at least some of the plurality of channels in the first configuration.
  • every audio object with its spatial position within this predetermined area may be labeled as a risky audio object for faulty rendering, and thus not directly included, with its corresponding metadata, as is in the output audio content.
  • the first configuration corresponds to a 5.1 - channel set-up or a 7.1 -channel set-up, wherein the predetermined area includes the predetermined positions of a front left channel, a front right channel, and a center channel in the first configuration.
  • An area close to the screen may thus be an example of a risky area.
  • an audio object positioned on top of the center channel may originate by 50% from the front left channel and by 50% from the front right channel in the multichannel audio signal, or by 50% from the center channel, by 25% from the front left channel and by 25% from the front right channel in the multichannel audio signal etc.
  • the predetermined positions of the front left, front right and center channels share a common value of a given coordinate (e.g., y- coordinate value) in the predefined coordinate system, wherein the predetermined area includes positions having a coordinate value of the given coordinate (e.g., y- coordinate value) up to a threshold distance away from said common value of the given coordinate (e.g., y-coordinate).
  • a given coordinate e.g., y- coordinate value
  • the predetermined area includes positions having a coordinate value of the given coordinate (e.g., y- coordinate value) up to a threshold distance away from said common value of the given coordinate (e.g., y-coordinate).
  • the front left, front right and center channels could share another common coordinate value such as an x-coordinate value or a z-coordinate value in case the predetermined coordinate system are e.g. rotated or similar.
  • the predetermined area may thus stretch a bit away from the screen area.
  • the predetermined area may stretch a bit away from the common plane in the three-dimensional space on which the front left, front right and center channels will be rendered in the a 5.1 -channel loudspeaker setup or a 7.1 -channel loudspeaker setup.
  • audio objects with spatial positions within this predetermined area may be handled differently based on how far away from the common plane their positions lay.
  • audio objects outside the predetermined area will in any case be included as is in the output audio content along with their respective metadata comprising the spatial position of the respective audio object.
  • the predetermined area comprises a first sub area
  • the method further comprises the step of:
  • the fraction value may be smaller than one if the risk is determined to exceed the threshold (e.g., in case the spatial position is within the predetermined area). Further, the fraction value may be zero if the spatial position is within the first sub area.
  • the method further comprises:
  • the determination of the fraction value is only made in case the risk is determined to exceed the threshold (e.g., in case the spatial position is within the predetermined area). According to other embodiments, in case the spatial position is not within the predetermined area, the fraction value will be 1 .
  • the fraction value is determined to be 0 if the spatial position is in the first sub area, is determined to be 1 if the spatial position is not in the predetermined area, and is determined to be between 0 and 1 if the spatial position is in the predetermined area but not in the first sub area.
  • the first sub area may for example correspond to the common plane in the three-dimensional space on which the front left, front right and center channels will be rendered in the a 5.1 -channel loudspeaker setup or a 7.1 -channel loudspeaker setup.
  • This means that audio objects extracted in the screen will be muted (not included in the output audio object content), objects far from the screen will be unchanged (included as is in the output audio object content), and objects in the transition zone will be attenuated according to the value of the fraction value or according to a value depending on the fraction value, such as the square root of the fraction value.
  • the latter may be used to follow a different normalization scheme, e.g. preserving energy sum of object/channel fractions instead of preserving amplitude sum of
  • the remainder of the audio object i.e., the audio object multiplied by 1 minus the fraction value, may be rendered to the channel beds.
  • it may be included in the output audio content together with metadata (e.g., metadata comprising the spatial position of the audio object) and additional metadata (described below).
  • the step of extracting at least one audio object from the multichannel audio signal comprises, for each extracted audio object, computing a first set of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal and relating to (e.g., indicating) an energy level of audio content of the audio object that was extracted from the specific channel, wherein the step of estimating a risk comprises the steps of:
  • the extracted audio object in its original format (e.g., 5.1/7.1 ) in the multichannel audio signal is compared with a rendered version in the original layout (e.g., 5.1/7.1 ). If the two versions are similar, allow object extraction as intended; otherwise, handle the audio object differently to reduce the risk of faulty rendering of the audio object.
  • This is a flexible and exact way of determining if an audio object will be faulty rendered or not and applicable on all configurations of the multichannel audio signal and spatial positions of the extracted audio object.
  • each energy level of the first set of energy levels may be compared to the corresponding energy level among the second set of energy levels.
  • the threshold may for example be 1 .
  • the difference of the value of the squared panning parameter (energy level) of the L-channel (0.8) and the value of the squared panning parameter (energy level) of the C-channel (0.4) in this case means that the energy level of the audio content, of the extracted audio object, extracted from the L-channel had twice the energy level compared to the audio content of the audio object which was extracted from the C- channel.
  • the step of calculating a difference between the first set of energy levels and the second set of energy levels comprises: using the first set of energy levels, rendering the audio object to a third plurality of channels in the first configuration, for each pair of corresponding channels of the third and second plurality of channels, measuring a Root-Mean-Square, RMS, value of each of the pair of channels, determining an absolute difference between the two RMS values, and calculate a sum of the absolute differences for all pairs of corresponding channels of the third and second plurality of channels, wherein the step of
  • determining whether the risk exceeds a threshold comprises comparing the sum to the threshold.
  • the threshold may for example be 1 .
  • the step of extracting at least one audio object from the multichannel audio signal comprises, for each extracted audio object, computing a first set of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal and relating to (e.g., indicating) an energy level of audio content of the audio object that was extracted from the specific channel, the method further comprising the step of: upon determining that the risk exceed the threshold, using the first set of energy levels for rendering the audio object to the output bed channels.
  • the present embodiment specifies an example of how to handle audio objects that are determined to be in the danger-zone for being faulty rendered.
  • the audio content of the audio object can be included in the output audio content in a similar way as it was received in the multichannel audio signal.
  • the content can be kept as a channel-based signal in the same format as in the input signal, and sent to the output bed channels. All that is needed is to apply the panning parameters (e.g., energy levels) to the extracted object, obtain the multichannel version of the object, and add it to the output bed channels. This is a simple way of making sure that the audio content of the audio object will be rendered as intended by the mixer of the multichannel audio signal.
  • the method further comprises the steps of multiplying the audio object with 1 minus the fraction value to achieve a second fraction of the audio object, and using the first set of energy levels for rendering the second fraction of the audio object to the output bed channels.
  • the audio content of the fraction of the audio object not included in the output audio content as described above is instead included in the output bed channels.
  • the method further comprises the step of, upon determining that the risk exceeds the threshold, including in the output audio content: the audio object, metadata comprising the spatial position of the audio object and additional metadata, wherein the additional metadata is configured so that it can be used at a rendering stage to ensure that the audio object is rendered in channels in the first configuration with predetermined positions corresponding to the
  • the method further comprises the steps of: including in the output audio content: the audio object, metadata comprising the spatial position of the audio object and additional metadata, wherein the additional metadata indicates at least one from the list of:
  • an audio object If an audio object is determined to be in the danger zone of being faulty rendered, it can be included as a special audio object in the output audio content, with additional metadata.
  • the additional metadata can then be used by a renderer to render the audio object in the channels initially intended by the mixer of the multichannel audio signal.
  • the additional metadata can comprise the panning parameters, or energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal and relating to (e.g., indicating) an energy level of audio content of the audio object that was extracted from the specific channel.
  • the additional metadata is included in the output audio content only upon determining that the risk exceeds the threshold.
  • the additional metadata comprises a zone mask, e.g. data pertaining to at least one channel of the plurality of channels which is not included in the specific subset of the plurality of channels from which the object was extracted.
  • the additional metadata may comprise a divergence parameter, which e.g. may define how large part of an audio object positioned near or on the predetermined position of the center channel in the first configuration that should be rendered in the center channel, and thus implicitly how large part that should be rendered in the left and right channel.
  • the step of extracting at least one audio object from the multichannel audio signal comprises, for each extracted audio object, computing the first set of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal and relating to (e.g., indicating) an energy level of audio content of the audio object that was extracted from the specific channel.
  • the method upon determining that the risk exceeds the threshold, the method further comprises the steps of:
  • step c)-f) as described above on each further audio object of the at least one further audio object.
  • Each further audio object may then be handled as described in any of the embodiments above.
  • the methods described above may be performed iteratively on the remaining multi channel audio signal when a first audio object has been extracted, to extract further audio objects and check if those should be included in the output audio content as is, or if they should be handled differently.
  • an iteration comprises extracting a plurality of audio objects (for example 1 , 2, 3, or 4) from the multichannel audio signal. It should be understood that in these cases, the methods described above are performed on each of the extracted audio objects.
  • a energy level of the obtained time frame of the difference multichannel audio signal is less than a second threshold energy level.
  • any of the methods above may be performed iteratively until one of these stop criteria is met. This may reduce the risk of extracting an audio object with a small energy level which may not improve the listening experience since a person will not perceive the audio content as a distinct object when playing e.g. the movie.
  • individual audio objects or sources are extracted from the direct signal (multichannel audio signal).
  • the contents that are not suitable to be extracted as objects are left in the residual signal which is then passed to the bed channels as well.
  • the bed channels are often in a similar configuration as the first configuration, e.g. a 7.1 configuration or similar wherein new content added to the channels are combined with the any already existing content of the bed channels.
  • a computer program product comprising a computer-readable storage medium with instructions adapted to carry out the method of the first aspect, when executed by a device having
  • a receiving stage arranged for receiving (e.g., configured to receive) the multichannel audio signal
  • an object extraction stage arranged for extracting (e.g., configured to
  • a spatial position estimating stage arranged for estimating (e.g., configured to estimate) a spatial position of the audio object
  • a risk estimating stage arranged for, based on the spatial position of the audio object, estimating (e.g., configured to estimate) a risk that a rendered version of the audio object in channels in the first configuration will be rendered in channels with predetermined positions differing from the predetermined positions of the specific subset of the plurality of channels from which the object was extracted, and determining whether the risk exceeds a threshold,
  • a converting stage arranged for, in response to the risk estimating stage determining that the risk does not exceed the threshold, including (e.g., configured to include) the audio object and metadata comprising the spatial position of the audio object in the output audio object content.
  • example embodiments propose methods for processing a time frame of audio content having a spatial position, devices
  • a method for processing a time frame of audio content having a spatial position comprising the steps of:
  • the coordinate system in this embodiment is normalized for ease of explanation, and thus encompasses any suitable coordinate system and ranges of the component of the coordinate system.
  • the inventors have realized that it would be advantageous to provide high- level controls to the mixer, controlling intuitive, high-level parameters that can vary over time and can either be controlled manually or pre-set, or inferred automatically based on the characteristics of the content of the audio objects.
  • Adjustment of the spatial position and/or the energy level of the audio content is advantageous in that the result of such
  • a single parameter may control the extent of the adjustment, which can be compared with turning on a knob on a mixer board. Consequently, if the control value is zero, no adjustment is made. If the control value is at its max value (e.g., 1 in case of a normalized control value, but any other range of control values may be possible such as 0-10), full adjustment of the property/properties of the audio content based on the distance value is made.
  • max value e.g., 1 in case of a normalized control value, but any other range of control values may be possible such as 0-10)
  • the control value may thus be user defined according to some embodiments. However, the control value may also be automatically generated by analyzing the audio content. For example, certain adjustments may only be suitable for music content, and not for dialogue content.
  • a dialogue detection stage and a music detection stage may be adapted to set the control value, increasing the adjustments (increased control value) when music and no dialogue are detected, and setting the control value to 0 when dialogue is detected which will lead to no adjustments as described above.
  • the embodiments for processing a time frame of audio content need not to be applied to all audio objects and/or channels in e.g. an input audio content.
  • one a subset of the audio objects is subjected to the methods described herein.
  • audio objects relating to dialog are not subjected, but instead kept as is.
  • only (a subset of) audio objects in the input audio content are subjected, while any channels-based audio content (e.g., bed channels) are left as is.
  • the properties of the audio content is determined to be adjusted if the distance value does not exceed a threshold value, wherein upon determining that properties of the audio content should be adjusted, the spatial position is adjusted at least based on the distance value and on the x- value of the spatial position.
  • the spatial position of audio content can be adjusted based on if it is near the screen, and based on where in the room it is positioned in an x-direction.
  • This embodiment may for example be used for achieving a spread out effect of audio objects near a specific area such as the screen which for example may have the effect that other sounds on screen (dialogue, effects, etc.) are more intelligible because spatial masking is reduced.
  • the step of adjusting the spatial position comprises adjusting the z value of the spatial position based on the x-value of the spatial position and adjusting the y value of the spatial position based on the x value of the spatial position.
  • audio objects and/or bed channels on screen may be mapped to an arc encompassing the screen from front left channel and front right channel.
  • the control value may control the amount of spread. If the control value is set to zero, the function doesn't affect the content. The effect is thus achieved by modifying audio content position (e.g., spatial position of an audio object or canonical position of a channel).
  • the properties of the audio content is determined to be adjusted only if the distance value exceeds a threshold value, wherein upon determining that properties of the audio content should be adjusted, the energy level is adjusted at least based on the distance value and on the z-value of the spatial position.
  • audio objects positioned away from a certain area e.g. the screen
  • the control value may control the amount of boost permitted.
  • the method comprises the step of, prior to the step of determining whether properties of the audio content should be adjusted, determining a current energy level of the time frame of the audio content, wherein the energy level of the audio content is adjusted also based on the current energy level. For example, subtle audio objects may be boosted more than not subtle audio objects which according to some embodiments should not be boosted at all. For this reason, according to some embodiments, the properties of the audio content is determined to be adjusted only if the current energy level does not exceed a threshold energy level.
  • the method comprises receiving an energy adjustment parameter pertaining to a previous time frame of the audio content, wherein the energy level is adjusted also based on the energy adjustment parameter. Consequently, the boost applied is adaptive to the boost previously applied, to achieve a smoother boosting of the audio content.
  • the properties of the audio content is determined to be adjusted only if the distance value exceeds a threshold value, wherein the z value of the spatial position is adjusted based on the distance value.
  • audio object/channels further from the predefined area e.g., the screen
  • the present embodiment may lift audio objects towards the ceiling when they were panned (as an example of being positioned) on the walls in the rear part of the room (as an example of the three-dimensional space).
  • the z value is adjusted to a first value for a first distance value, and to a second value lower than the first value for a second distance value being lower than the first distance value. Accordingly, audio
  • objects/channels further back in the room may be pushed closer to the ceiling compared to objects/channels closer to the screen.
  • a computer program product comprising a computer-readable storage medium with instructions adapted to carry out the method according to the second aspect when executed by a device having processing capability.
  • a device for processing a time frame of an audio content comprising a processor arranged (e.g., configured) to:
  • the step of determining a distance value comprises using the y component of the spatial position as the distance value
  • the processor is arranged to receive a control value and adjust at least one of the spatial position and an energy level of the audio content at least based on the distance value and the control value.
  • the format of output audio content is exemplified as Dolby Atmos content.
  • this is just an example and any other object-based sound format may be used.
  • the x component indicates the dimension that extends from left to right
  • the y component indicates the dimension that extends from front to back
  • the z component indicates the dimension that extends from bottom to top.
  • This coordinate system is shown in figure 17.
  • any 3D coordinate system is covered by the present disclosure.
  • Legacy-to-Atmos is a content creation tool that takes 5.1 or 7.1 content (which could be a full mix, or parts of it, e.g., stems) and turn this legacy content into Atmos content, consisting of audio objects (audio + metadata) and bed channels.
  • objects are extracted from the original mix by applying source separation to the direct component of the signal. Source separation is exemplified above, and will not be discussed further in this disclosure.
  • Source separation is just an example and any other method for converting legacy content to an object-based sound format may be used.
  • the spatial position metadata (e.g., in the form of x, y) of extracted objects 1 12, 1 14 is estimated from the channel levels, as shown in figures 1 a-b.
  • the circles 102-1 10 represent the channels of a 5.1 audio signal (which is an example of a multichannel audio signal which comprises a plurality of channels in a first configuration, e.g., a 5.1 channel configuration), and their darkness represents the audio level of each channel.
  • a 5.1 audio signal which is an example of a multichannel audio signal which comprises a plurality of channels in a first configuration, e.g., a 5.1 channel configuration
  • their darkness represents the audio level of each channel.
  • the audio object 1 12 in figure 1 a most of the audio content can be found in the front left channel (L) 102, some of the audio content can be found in the center channel (C) 104 and a little audio content can be found in the rear left channel 108. All channels in such a configuration have a predetermined position pertaining to a loudspeaker setup and defined in a
  • predetermined coordinate system e.g., as shown in figure 17.
  • Figures 1 a-b each shows a time frame of a multichannel audio signal for a specific audio object. It should be noted that figures 1 a-b show the simplified case where only one audio object is included in the multichannel audio signal, for ease of description.
  • LTA will extract an audio object 1 12, 1 14 from the time frame of the
  • the audio objects 1 12, 1 14 are extracted from a specific subset of the plurality of channels, e.g. the subset of the front left channel 102, the center channel 104 and the rear left channel 108 for figure 1 a, and the front left channel 102 and the front right channel (R) in figure 1 b.
  • a spatial position for each audio object 1 12, 1 14 is estimated and shown in the by the squares 1 12, 1 14 in figures 1 a-b.
  • the result obtained for the rendered audio object 1 12 is identical (or very similar) to the originally received time frame of the multichannel audio signal.
  • the audio object 1 14 that was originally intended to be located in the centre by phantom imaging i.e., by using only the front left channel 102 and front right channel 106
  • the center channel 104 is now fully rendered to the center channel 104, irrespective of the initial artistic intention by the mixer that prevented it to activate the centre speaker. This is an example of violating the original artistic intention, potentially leading to a significantly degraded listening experience.
  • the audio objects which are in risk of being faulty rendered should be handled differently to reduce the risk of such violation.
  • only audio objects not in risk (or with a risk below a certain threshold) of being faulty rendered should be included in the output audio object content in a normal way, i.e. as audio content and metadata comprising the spatial position of the audio object.
  • a device and method for converting a time frame of a multichannel audio signal into output audio content comprising audio objects, metadata comprising a spatial position for each audio object, and bed channels, will now be described by way of example in conjunction with figures 2 and 16.
  • An audio stream 202 (i.e., the multichannel audio signal), is received S1602 by the device 200 at a receiving stage (not shown) of the device.
  • the device 200 further comprises an object extraction stage 204 arranged for extracting S1604 at least one audio object 206 from the time frame of the multichannel audio signal .
  • the number of extracted objects at this stage may be user defined, or predefined, and may be any number between one and an arbitrary number (n). In an example embodiment, three audio objects are extracted at this stage. However, for ease of explanation, in the below description, only one audio object is extracted at this stage.
  • panning parameters 208 e.g., a set 208 of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal 202 and relating to (e.g., indicating) an energy level of audio content of the audio object 206 that was extracted from the specific channel
  • panning parameters 208 are also computed. Since each channel in the multichannel audio signal has a predetermined position in space, panning
  • Both the audio object and the panning parameters are sent to spatial position estimating stage 203 arranged for estimating S1606 a spatial position of the audio object. This estimation S16060 is thus done using the panning parameters and a spatial position (x, y) 207 is outputted from the spatial position estimating stage 203 along with the audio object 206 and the panning parameters 208.
  • a risk estimating stage 210 is arranged for estimating S1608 a risk that a rendered version of the audio object 206 in channels in the first configuration will be rendered in channels with predetermined positions differing from the predetermined positions of the specific subset of the plurality of channels from which the object was extracted.
  • the risk estimation stage 210 is arranged to detect when artistic intention is at stake, i.e. by determining S1610 whether the risk exceeds a threshold.
  • the algorithms used in the risk estimation stage 210 will be further described below in conjunction with figures 3a, 3b and 4.
  • the audio object 206 and metadata are included in the output audio content (e.g., the output audio object content).
  • the audio object 206 and the spatial position 207 are sent to a converting stage 216 which is arranged for including the audio object 206 and metadata comprising the spatial position 207 of the audio object in the output audio object content 222 which is part of the output audio content 218.
  • Any metadata (e.g., metadata comprising the spatial position 207 of the audio object) may be added to the output audio object content, for example in any of the following forms:
  • a separate file e.g. a text file with the same name of the audio object file
  • the panning parameters 208 and the audio object 206 are sent to an artistic preservation stage 212.
  • the functionality and algorithms of the artistic preservation stage 212 is described below in conjunction with figures 5 and 6.
  • a first example embodiment of a risk estimation stage 210 is shown in figure
  • This embodiment is based on computing the position of an extracted object, and determining how much of it should be extracted, and how much should be preserved.
  • a smaller figure 3b is interspersed showing, by way of example, an extracted audio object 206 on a 5.1 layout (coordinates according to figure 17).
  • a predetermined area 302 is shown in the layout of figure 3b.
  • the risk is determined to not exceed the threshold and consequently, the audio object 206 and metadata comprising the spatial position 208 of the audio object is included as is in the output audio object content 222 which is part of the output audio content 218.
  • the predetermined area 302 may according to embodiments include the predetermined positions of at least some of the plurality of channels in the first configuration.
  • the first configuration corresponds to a 5.1 -channel set-up and the predetermined area 302 included the predetermined positions of the L, C and R channels in the first configuration.
  • a 7.1 layout is equally possible.
  • the predetermined positions of the C, R and C channels share a common y-coordinate value (e.g., 0) in the predefined coordinate system.
  • the predetermined area includes positions having a y-coordinate value up to a threshold distance a away from said common y- coordinate.
  • the spatial position is determined to be outside the predetermined area 302, i.e. further away from the common y-coordinate (i.e., 0 in this example), the risk is determined to not exceed the threshold.
  • the predetermined area comprises a first sub area 304.
  • a fraction value is determined by the risk estimation stage 210. The fraction value corresponds to a fraction of the audio object to be included in the output audio content and is based on a distance between the spatial position 206 and the first sub area 304, wherein the value is a number between zero and one.
  • Other suitable functions and values of a are equally possible.
  • the extracted audio object 206 is multiplied by the fraction to extract. This way, objects in the first sub area (e.g., in the screen) will be muted, audio objects far from the first sub area will be unchanged, and audio objects 206 in the transition zone (in the predetermined area 302 but not in the first sub area 304) will be attenuated according to the value of the function.
  • the fraction of the audio object (or the full audio object) 314 and metadata comprising the spatial position 207 of the audio object 206 are sent to the converting stage 216 which is arranged for including the fraction of the audio object (or the full audio object) 314 and metadata comprising the spatial position 207 of the audio object in the output audio object content 222 which is part of the output audio content 218.
  • the extracted audio object is multiplied by 1 minus the fraction value (e.g., 1 -f(y)) and the resulting fraction of the audio object 308 is sent to the artistic preservation stage 212 which is exemplified below in conjunction with figures 5-6.
  • the fraction value e.g., 1 -f(y)
  • FIG. 4 Another embodiment of the risk estimation stage 210 is shown in figure 4. This embodiment is based on comparing the extracted object in its original configuration (e.g., 5.1/7.1 layout) with a rendered version in the same configuration (e.g., 5.1/7.1 ), according to the below.
  • the panning parameters 208 are needed.
  • the extracting of an audio object (see figure 2, the object extraction stage or source separation stage 204) from the multichannel audio signal comprises computing a first set of energy levels, where each energy level corresponds to a specific channel of the plurality of channels of the multichannel audio signal and relates to (e.g., indicating) an energy level of audio content of the audio object that was extracted from the specific channel.
  • the panning parameters 208 are thus received by the risk estimation stage 210 along with the extracted audio object 206 and the estimated spatial position 207.
  • the spatial position of the audio object is used for rendering the audio object to a second plurality of channels in the first configuration and computing a second set of energy levels based on the rendered object, each energy level corresponding to a specific channel of the second plurality of channels in the first configuration and relating to (e.g., indicating) an energy level of audio content of the audio object that was rendered to the specific channel of the second plurality of channels.
  • the two sets of energy levels are then compared and a difference is calculated, for example using the absolute difference of each corresponding energy levels (e.g., of each pair of corresponding energy levels). Based on this difference, the risk is estimated.
  • Figure 4 shows a further embodiment based on comparing the extracted object in its original configuration (e.g., 5.1/7.1 layout) with a rendered version in the same configuration (e.g., 5.1/7.1 ).
  • the step of calculating a difference between the first set of energy levels and the second set of energy levels comprises using the first set of energy levels 208, rendering the audio object using a renderer 402 to a third plurality of channels 406 in the first configuration.
  • this embodiment comprises rendering the audio object 206 using a renderer 402 to a second plurality of channels 408 in the first configuration.
  • the audio object 206 and metadata are included into the output audio content (e.g., output audio object content).
  • the audio object 206 and metadata are sent to the converting stage 216 as described above.
  • the audio object 206 and the set of energy levels 208 is sent to the artistic preservation stage 212. Embodiments of such stage 212 will now be described in conjunction with figures 5-6.
  • the extracted object if the extracted object is detected as violating an artistic intention (exceeding the threshold), its content in the original multichannel format (e.g., 5.1/7.1 ) is kept as a residual signal and added to the output bed channels.
  • This embodiment is shown in figure 5.
  • the panning parameters, or the set of energy levels computed when extracting the audio object from the multichannel audio signal are needed. For this reason, the panning parameters 208 and the audio object is both sent to the artistic preservation stage 212.
  • the panning parameters 208 are applied to the extracted object 206 to obtain the multichannel version 502 of the object to preserve.
  • the multi channel version 502 is then added to the output bed channels 224 in the converting stage 216.
  • a second fraction of the audio object is received by the artistic preservation stage 212 along with the panning parameters 208 of the audio object.
  • the second fraction is achieved by multiplying the audio object with 1 minus the fraction value f(y) (figure 3c) and using the first set of energy levels 208 for rendering the second fraction of the audio object to the bed channels via a multichannel version 502 of the second fraction of the object, as described above.
  • Figure 6 shows another example of the artistic preservations stage 212.
  • This embodiment is based on computing additional metadata to accompany object extraction in cases where artistic intention may be violated by normal object extraction. If the extracted object is detected as violating an artistic intention (as described above), it can be stored as a special audio object along with additional metadata (e.g., its panning parameters that describe how it was panned in the original 5.1/7.1 layout) and included in the output audio object content 222 which is part of the output audio content 218.
  • additional metadata e.g., its panning parameters that describe how it was panned in the original 5.1/7.1 layout
  • This method also applies to the partially preserved object (second fraction) resulting from the embodiment of figure 3a-c.
  • the additional metadata is computed using the panning parameters 208 and can be used to preserve the original artistic intention, e.g. by one of the following methods at the rendering stage:
  • the additional metadata can be used at the rendering stage to ensure that the audio object is rendered in channels in the first configuration with
  • the artistic preservation stage 212 is computing an additional metadata 602 which is sent to the converting stage 216 and added to the output audio content 218 along with the audio object and the metadata comprising the spatial position 207 of the audio object 206.
  • the additional metadata 602 indicates at least one from the list of:
  • the additional metadata 602 may indicate the panning
  • parameters (set of energy levels) 208 computed when extracting the audio object 206.
  • the extracted object were detected as violating an artistic intention, using either the embodiments of figure 5 or 6 to preserve the artistic intention would neutralise the object extraction itself.
  • the extracted object might be left without signal by applying the embodiment of figures 3a-c if the fraction to be extracted is zero.
  • a third multichannel audio signal i.e., a difference signal
  • the stop criterion may be at least one stop criterion from the following list of stop criteria:
  • an energy level of an extracted further object is less than a first threshold energy level
  • a total number of extracted objects exceed a threshold number, e.g. 1 , 3 or
  • an energy level of the obtained time frame of the difference multichannel audio signal is less than a second threshold energy level.
  • the disclosure will now turn to methods, devices and computer program products for modifying e.g. the output of LTA (processing a time frame of an audio object) in order to enable artistic control over the final mix.
  • LTA processing a time frame of an audio object
  • All methods relate to processing a time frame of audio content having a spatial position.
  • the audio content is exemplified as an audio object, but it should be noted that the methods described below also applies to audio channels, based on their canonical positions. Also, for simplicity of description, sometimes the time frame of an audio object is referred to as "the audio object”.
  • Legacy-to-Atmos is a content creation tool that takes 5.1 or 7.1 content (which could be a full mix, or parts of it, e.g., stems) and turns it into Atmos content, consisting of objects (audio + metadata) and bed channels.
  • Atmos content consisting of objects (audio + metadata) and bed channels.
  • Such process is typically blind, based on a small set of predefined parameters that provide a very small degree of aesthetical control over the result. It is thus desirable to enable a processing chain that modifies the output of LTA in order to enable artistic control over the final mix.
  • the direct manipulation of each individual object extracted by LTA is, in many cases, not viable (objects too unstable and/or with too much leakage from others, or simply too time-consuming).
  • - Screen Spread spreading of objects in a specific region (e.g., near the screen). According to some embodiments, the screen spread effect is only applied to music content, and not to dialogue content.
  • - Height boost increasing the level of subtle elements positioned away from critical regions (e.g., objects away from the screen and the horizontal plane).
  • - Ceiling attraction repositioning of elements, e.g. increasing their height as a function of their distance from the screen.
  • Each method is for processing a time frame of an audio object.
  • a device 1800 implementing the method is shown in figure 18.
  • the device comprises a processor arranged to receiving the time frame of the audio object 1810, and to determine a spatial position of the time frame of the audio object 1810 in a position estimation stage 1802. Such determination may for example be done using a received metadata comprising the spatial position of the audio object and received in conjunction with receiving the time frame of the audio object 1810.
  • the time frame of the audio object 1810 and the spatial position 1812 of the audio object is then sent to an adjustment determination stage 1804.
  • the processor determines whether properties of the audio object should be adjusted. According to some embodiments, such determination can also be made based on a control value 1822 received by the adjustment determination stage 1804. For example, if the control value 1822 is 0 (i.e., no adjustment to be made), the value can be used to exit the adjustment determination stage 1804 and send the time frame of the audio object 1810 as is to an audio content production stage 1808. In other words, in case it is determined that properties should not be adjusted, the time frame of the audio object 1810 is sent as is to an audio content production stage 1808 to be included in the output audio content 1820.
  • the time frame of the audio object 1810 and the spatial position 1812 of the audio object are sent to a distance calculation stage 1804 which is arranged to determine a distance value 1814 by comparing the spatial position 1812 of the audio object to a predetermined area.
  • a distance calculation stage 1804 which is arranged to determine a distance value 1814 by comparing the spatial position 1812 of the audio object to a predetermined area.
  • the distance value is determined using the y component of the spatial position as the distance value.
  • the distance value 1814, the spatial position 1812 and the time frame of the audio object 1810 is sent to a properties adjustment stage 1806, which also receives a control value 1822. Based on at least the distance value 1806 and the control value 1822 at least one of the spatial position and an energy level of the audio object is adjusted. In case the spatial position is adjusted, the adjusted spatial position 1816 is sent to the audio content production stage 1808 to be included in the output audio content 1820 along with the (optionally adjusted) time frame the audio object 1810.
  • Figure 7-10 describe a method for spreading sound to the proscenium speakers (Lw, Rw), and optionally even using the first line of ceiling speakers to create an arch around the screen.
  • the properties of the audio object are determined to be adjusted if the distance value does not exceed a threshold value, i.e. the spatial position is close to the screen.
  • This can be controlled using the function 802 (yControl(y)) shown in figure 8, which has a value of 1 near the screen and decays to zero away from the screen, where reference 804 represent the threshold value as described above.
  • the spatial position is adjusted at least based on the distance value and on the x-value of the spatial position.
  • the z value of the spatial position of the object may be adjusted based on the x-value of the spatial position, e.g. as described in figure 10 where two transfer functions 1002, 1004 between the x-value of the spatial position and their respective effect on the z-value of the spatial position of the audio object are shown.
  • the y value of the spatial position may be adjusted based on the x value of the spatial position as described in figure 9.
  • the method described in figure 7-10 includes:
  • bed channels do not have associated position metadata; in order to apply the processing to L, C, R channels, in the current implementation they may be turned into static objects located at their canonical positions. As such, also the spatial position of bed channels can be modified according to this embodiment.
  • Figures 1 1 -13 show a method for processing a time frame of an audio object according to another embodiment.
  • the effect of LTA vs. the original 5.1/7.1 multichannel audio signal (legacy signal) is subtle. This is due to the fact that the perception of sound in 3D seems to call for enhanced immersion, i.e. boost of subtle out-of-screen and ceiling sounds. For this reason, it may be advantageous to have a method to boost subtle (soft) audio objects and bed channels when they are out of the screen. Bed channels may be turned into static objects as described above. According to some embodiments, the boost may increases proportionally to the z coordinate, so objects on the ceiling and Lc/Rc bed channels are boosted more, while objects on the horizontal plane are not boosted.
  • the properties of the audio object are determined to be adjusted only if the distance value exceeds a threshold value, wherein upon determining that properties of the audio object should be adjusted, the total energy level is adjusted at least based on the distance value and on the z-value of the spatial position.
  • Figure 12 shows a transfer function between a y-coordinate (of the time frame) of the audio object, and a max boost of the energy level (e.g., RMS).
  • RMS max boost of the energy level
  • the threshold value could be 0 or 0.01 or 0.1 or any other suitable value.
  • Figure 13 shows a transfer function between a z-coordinate (of the time frame) of the audio object, and a max boost of the energy level. The energy level is thus adjusted based on the distance value and on the z-value of the spatial position.
  • Figure 1 1 shows by way of example how boosting of low energy audio objects may be achieved.
  • Figure 1 1 left, shows boosting the low level parts.
  • a max boost limit 1 104 allows us to obtain the desirable curve of figure 1 1 , right.
  • first energy level of the time frame of the audio object needs to be determined, e.g. the RMS of the audio content of the audio object.
  • the energy level is adjusted also based on this energy level, but only if the energy level does not exceed a threshold energy level 1 102.
  • the boost is adapted to a boost at previous frames for this audio object, to achieve a smooth boosting of the audio object.
  • the method may comprise receiving an energy adjustment parameter pertaining to a previous time frame of the audio object, wherein the energy level is adjusted also based on the energy adjustment parameter.
  • the algorithm for adjusting the energy level of the audio object may be as follow:
  • adaptive_boost alpha_attack * boost + (1 - alpha_attack) * previous_boost;
  • adaptive_boost alpha_release * boost + (1 - alpha_release) * previous_boost;
  • alpha_attack and alpha_release are different time constants depending on whether the level of the previous audio frame was softer or louder than the current one
  • a user control "boost amount" in the range [0 1] is converted to max boost limit 1 104 and the threshold energy level 1 102 so that a value 0 has no effect, while a value of 1 achieves maximum effect.
  • - Boost has to depend on loudness and position.
  • Figures 14-15 shows other embodiments of methods for processing a time frame of an audio object.
  • Extracted objects are located in the room according to their spatial position (x,y) inferred from the 5.1/7.1 audio, and the z coordinate may be a function of the spatial position (x,y) such that as the object moves inside the room, the z-value increases.
  • the z coordinate may be a function of the spatial position (x,y) such that as the object moves inside the room, the z-value increases.
  • Figure 14-15 describe a method for pushing objects to the ceiling when they were panned on the walls in the rear part of the room.
  • the proposed method consists of modifying the canonical 5.1/7.1 speaker positions by pushing the surround speakers (Lrs, Rrs) inside the room, so that audio objects located on the walls will naturally gain elevation.
  • the z value of the spatial position may then be adjusted based on the distance value. For example, the further back in the room the spatial position is the larger will the z-value be.
  • the z value is adjusted to first value for a first distance value, and to a second value lower than the first value for a second distance value being lower than the first distance value.
  • the object position (x,y) is computed from the gains of the 5.1/7.1 speakers and their canonical position, essentially by inverting the panning law. If the surround speakers are moved from their canonical position, towards the centre of the room, when inverting the panning laws, a warping of objects trajectories are achieved, essentially bending them inside the room, and therefore resulting in the z coordinate to grow.
  • Figure 14 illustrates the concept where the Lrs and the Rrs speakers 1404, 1406 are moved towards the center of the room, which means that also the position of the audio object 1402 is moved. How much the speakers are moved into the room may depend on the parameter "remap amount" in the range [0, 1], where a value of 0 produces no change in the usual obtained object position, while a value of 1 reaches the full effect.
  • the input to this algorithm is the position of the object (x, y, z) and the amount of remapping (i.e., the control value).
  • the output is a new object position where (x, y) are preserved and z is adjusted.
  • the above effect can be applied to the channels (e.g., bed channels) by turning them into static objects at canonical positions.
  • the channels e.g., bed channels
  • the present disclosure also relate to a method for storing, archiving, rendering or streaming content produced with the above methods
  • the method is based on the observation that the final Atmos content, when authored via LTA and the post-processing described above, can be re-obtained from the information contained only in:
  • Advantages of this method are multiple. When storing/archiving in this way, space (computer memory) is saved. When streaming/broadcasting, there is just need to add a tiny amount of bandwidth over the standard 5.1/7.1 content, as long as the receivers are able to run LTA on the 5.1/7.1 content using the additional parameters. Also, in workflows for language dubbing, the 5.1/7.1 stems are always distributed anyway. So if the LTA version is supposed to be dubbed, all that worldwide studios need to share, besides what they currently do, is the small file containing the LTA parameters as described above.
  • the set of parameters to be stored include all those described in this disclosure, as well as all others needed to fully determine the LTA process, including for example, those disclosed in the above disclosure aimed at preserving artistic decisions made during creation of the original 5.1/7.1 .
  • the division of tasks between functional units or stages referred to in the above description does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation.
  • Certain components or all components may be implemented as software executed by a digital signal processor or microprocessor, or be implemented as hardware or as an application-specific integrated circuit.
  • Such software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes both volatile and non- volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • EEEs enumerated example embodiments
  • EEE 1 A method for converting a time frame of a multichannel audio signal into output audio content comprising audio objects, metadata comprising a spatial position for each audio object, and bed channels, wherein the multichannel audio signal comprises a plurality of channels in a first configuration, each channel in the first configuration having a predetermined position pertaining to a loudspeaker setup and defined in a predetermined coordinate system, the method comprising the steps of:
  • EEE 2 upon determining that the risk does not exceed the threshold, include the audio object and metadata comprising the spatial position of the audio object in the output audio object content.
  • EEE 3 The method of EEE 2, wherein the predetermined area includes the predetermined positions of at least some of the plurality of channels in the first configuration.
  • EEE 4. The method of EEE 3, wherein the first configuration corresponds to a 5.1 -channel set-up or a 7.1 -channel set-up, wherein the predetermined area includes the predetermined positions of a front left channel, a front right channel, and a center channel in the first configuration.
  • EEE 5. The method of EEE 4, wherein the predetermined positions of the front left, front right and center channels share a common y-coordinate value in the predefined coordinate system, wherein the predetermined area includes positions having a y-coordinate value up to a threshold distance away from said common y- coordinate value.
  • EEE 6 The method of any one of EEEs 2-5, wherein the predetermined area comprises a first sub area, the method further comprises the step of: determining a fraction value corresponding to a fraction of the audio object to be included in the output audio object content based on a distance between the spatial position and the first sub area, wherein the value is a number between zero and one,
  • the method further comprises:
  • EEE 7 The method of EEE 1 , wherein the step of extracting at least one audio object from the multichannel audio signal comprises, for each extracted audio object, computing a first set of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal and relating to an energy level of audio content of the audio object that was extracted from the specific channel,
  • step of estimating a risk comprises the steps of:
  • EEE 8 The method of EEE 7, wherein the step of calculating a difference between the first set of energy levels and the second set of energy levels comprises: using the first set of energy levels, rendering the audio object to a third plurality of channels in the first configuration,
  • step of determining whether the risk exceeds a threshold comprises comparing the sum to the threshold.
  • step of extracting at least one audio object from the multichannel audio signal comprises, for each extracted audio object, computing a first set of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal and relating to an energy level of audio content of the audio object that was extracted from the specific channel, the method further comprising the step of: upon determining that the risk exceed the threshold, using the first set of energy levels for rendering the audio object to the output bed channels.
  • EEE 10 The method of EEE 9 when dependent on EEE 6, further comprising the steps of:
  • EEE 1 1 The method of any one of EEEs 1 -8, further comprising the step of: including in the output audio object content: the audio object, metadata comprising the spatial position of the audio object and additional metadata, wherein the additional metadata indicates at least one from the list of:
  • EEE 12 The method of EEE 1 1 , wherein the step of extracting at least one audio object from the multichannel audio signal comprises, for each extracted audio object, computing a first set of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the multichannel audio signal and relating to an energy level of audio content of the audio object that was extracted from the specific channel, wherein the additional metadata comprises the first set of energy levels.
  • EEE 13 The method according to any one of EEEs 1 -12, wherein the step of extracting at least one audio object from the multichannel audio signal comprises, for each extracted audio object, computing the first set of energy levels, each energy level corresponding to a specific channel of the plurality of channels of the
  • the method further comprises the steps of:
  • step c)-f) on each further audio object of the at least one further audio object.
  • EEE 14 The method of EEE 13, wherein the method of any one of EEEs 2-12 is performed on each further audio object of the at least one of further audio object.
  • EEE 15 The method of any one of EEEs 13-14, wherein yet further at least one audio objects are extracted as described in EEE 13, until at least one stop criteria of the following list of stop criterion is met:
  • an energy level of an extracted further audio object is less than a first threshold energy level
  • a total number of extracted audio objects exceed a threshold number, and a energy level of the obtained time frame of the difference multichannel audio signal is less than a second threshold energy level.
  • EEE 16 A computer program product comprising a computer-readable storage medium with instructions adapted to carry out the method of any one of EEEs 1 -15 when executed by a device having processing capability.
  • EEE 17. A device for converting a time frame of a multichannel audio signal into output audio content comprising audio objects, metadata comprising a spatial position for each audio object, and bed channels, wherein the multichannel audio signal comprises a plurality of channels in a first configuration, each channel in the first configuration having a predeternnined position pertaining to a loudspeaker setup and defined in a predeternnined coordinate system, the device comprises:
  • a receiving stage arranged for receiving the multichannel audio signal
  • an object extraction stage arranged for extracting an audio object from the time frame of the multichannel audio signal, wherein the audio object being extracted from a specific subset of the plurality of channels
  • a spatial position estimating stage arranged for estimating a spatial position of the audio object
  • a risk estimating stage arranged for, based on the spatial position of the audio object, estimating a risk that a rendered version of the audio object in channels in the first configuration will be rendered in channels with predetermined positions differing from the predetermined positions of the specific subset of the plurality of channels from which the object was extracted, and determining whether the risk exceeds a threshold
  • a method for processing a time frame of audio content having a spatial position comprising the steps of:
  • EEE 19 The method of EEE 18, wherein the properties of the audio content is determined to be adjusted if the distance value does not exceed a threshold value, wherein upon determining that properties of the audio content should be adjusted, the spatial position is adjusted at least based on the distance value and on the x- value of the spatial position.
  • EEE 20 The method of EEE 19, wherein the step of adjusting the spatial position comprises adjusting the z value of the spatial position based on the x-value of the spatial position and adjusting the y value of the spatial position based on the x value of the spatial position.
  • EEE 21 The method of EEE 18, wherein the properties of the audio content is determined to be adjusted only if the distance value exceeds a threshold value, wherein upon determining that properties of the audio content should be adjusted, the energy level is adjusted at least based on the distance value and on the z-value of the spatial position.
  • EEE 22 The method of EEE 21 , further comprising the step of, prior to the step of determining whether properties of the audio content should be adjusted, determining a current energy level of the time frame of the audio content, wherein the energy level is adjusted also based on the current energy level.
  • EEE 23 The method of EEE 22, wherein the properties of the audio content is determined to be adjusted only if the current energy level does not exceed a threshold energy level.
  • EEE 24 The method of any one of EEE 21 -23, further comprises receiving an energy adjustment parameter pertaining to a previous time frame of the audio content, wherein the energy level is adjusted also based on the energy adjustment parameter.
  • EEE 25 The method of EEE 18, wherein the properties of the audio content is determined to be adjusted only if the distance value exceeds a threshold value, wherein the z value of the spatial position is adjusted based on the distance value.
  • EEE 26 The method of EEE 25, wherein the z value is adjusted to first value for a first distance value, and to a second value lower than the first value for a second distance value being lower than the first distance value.
  • EEE 27 A computer program product comprising a computer-readable storage medium with instructions adapted to carry out the method of any one of EEEs 18-26 when executed by a device having processing capability.
  • a device for processing a time frame of an audio content comprising a processor arranged to:
  • the processor upon determining that properties of the audio content should be adjusted, is arranged to receive a control value and adjust at least one of the spatial position and an energy level of the audio content at least based on the distance value and the control value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)
PCT/EP2017/062848 2016-06-01 2017-05-29 A method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position WO2017207465A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201780033796.2A CN109219847B (zh) 2016-06-01 2017-05-29 将多声道音频内容转换成基于对象的音频内容的方法及用于处理具有空间位置的音频内容的方法
US16/303,415 US10863297B2 (en) 2016-06-01 2017-05-29 Method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position
CN202310838307.8A CN116709161A (zh) 2016-06-01 2017-05-29 将多声道音频内容转换成基于对象的音频内容的方法及用于处理具有空间位置的音频内容的方法
EP17726613.7A EP3465678B1 (en) 2016-06-01 2017-05-29 A method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
ESP201630716 2016-06-01
ES201630716 2016-06-01
EP16182117 2016-08-01
EP16182117.8 2016-08-01
US201662371016P 2016-08-04 2016-08-04
US62/371,016 2016-08-04

Publications (1)

Publication Number Publication Date
WO2017207465A1 true WO2017207465A1 (en) 2017-12-07

Family

ID=60479173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/062848 WO2017207465A1 (en) 2016-06-01 2017-05-29 A method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position

Country Status (2)

Country Link
CN (1) CN109219847B (zh)
WO (1) WO2017207465A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11429340B2 (en) * 2019-07-03 2022-08-30 Qualcomm Incorporated Audio capture and rendering for extended reality experiences
US11937070B2 (en) * 2021-07-01 2024-03-19 Tencent America LLC Layered description of space of interest

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016014815A1 (en) * 2014-07-25 2016-01-28 Dolby Laboratories Licensing Corporation Audio object extraction with sub-band object probability estimation
US20160150343A1 (en) * 2013-06-18 2016-05-26 Dolby Laboratories Licensing Corporation Adaptive Audio Content Generation
WO2016106145A1 (en) * 2014-12-22 2016-06-30 Dolby Laboratories Licensing Corporation Projection-based audio object extraction from audio content

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010105695A1 (en) * 2009-03-20 2010-09-23 Nokia Corporation Multi channel audio coding
US9137611B2 (en) * 2011-08-24 2015-09-15 Texas Instruments Incorporation Method, system and computer program product for estimating a level of noise
JP6163545B2 (ja) * 2012-06-14 2017-07-12 ドルビー・インターナショナル・アーベー 可変数の受信チャネルに基づくマルチチャネル・オーディオ・レンダリングのためのなめらかな構成切り換え
US9460723B2 (en) * 2012-06-14 2016-10-04 Dolby International Ab Error concealment strategy in a decoding system
US9412385B2 (en) * 2013-05-28 2016-08-09 Qualcomm Incorporated Performing spatial masking with respect to spherical harmonic coefficients
TW202322101A (zh) * 2013-09-12 2023-06-01 瑞典商杜比國際公司 多聲道音訊系統中之解碼方法、解碼裝置、包含用於執行解碼方法的指令之非暫態電腦可讀取的媒體之電腦程式產品、包含解碼裝置的音訊系統

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160150343A1 (en) * 2013-06-18 2016-05-26 Dolby Laboratories Licensing Corporation Adaptive Audio Content Generation
WO2016014815A1 (en) * 2014-07-25 2016-01-28 Dolby Laboratories Licensing Corporation Audio object extraction with sub-band object probability estimation
WO2016106145A1 (en) * 2014-12-22 2016-06-30 Dolby Laboratories Licensing Corporation Projection-based audio object extraction from audio content

Also Published As

Publication number Publication date
CN109219847B (zh) 2023-07-25
CN109219847A (zh) 2019-01-15

Similar Documents

Publication Publication Date Title
US10863297B2 (en) Method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position
US10595152B2 (en) Processing spatially diffuse or large audio objects
US10362426B2 (en) Upmixing of audio signals
US10638246B2 (en) Audio object extraction with sub-band object probability estimation
US9378747B2 (en) Method and apparatus for layout and format independent 3D audio reproduction
WO2016196226A1 (en) Processing object-based audio signals
US20180115850A1 (en) Processing audio data to compensate for partial hearing loss or an adverse hearing environment
JP2018038086A (ja) サウンドステージ拡張用の装置及び方法
US20200275233A1 (en) Improved Rendering of Immersive Audio Content
US20210195361A1 (en) Method and device for audio signal processing for binaural virtualization
JP2019505842A (ja) 適応的な量子化
WO2017207465A1 (en) A method converting multichannel audio content into object-based audio content and a method for processing audio content having a spatial position
JP7332781B2 (ja) オーディオコンテンツのプレゼンテーションに依存しないマスタリング
US9653065B2 (en) Audio processing device, method, and program
KR20150124176A (ko) 다채널 오디오 신호의 채널 이득 제어 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17726613

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017726613

Country of ref document: EP

Effective date: 20190102