EP4210352A1 - Audio apparatus and method of operation therefor - Google Patents

Audio apparatus and method of operation therefor Download PDF

Info

Publication number
EP4210352A1
EP4210352A1 EP22150861.7A EP22150861A EP4210352A1 EP 4210352 A1 EP4210352 A1 EP 4210352A1 EP 22150861 A EP22150861 A EP 22150861A EP 4210352 A1 EP4210352 A1 EP 4210352A1
Authority
EP
European Patent Office
Prior art keywords
audio source
audio
point
positions
point audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22150861.7A
Other languages
German (de)
French (fr)
Inventor
Sam Martin JELFS
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke Philips NV
Original Assignee
Koninklijke Philips NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips NV filed Critical Koninklijke Philips NV
Priority to EP22150861.7A priority Critical patent/EP4210352A1/en
Publication of EP4210352A1 publication Critical patent/EP4210352A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems

Definitions

  • the invention relates to an apparatus and method of operation therefor, and in particular, but not exclusively, to an approach for generating a data representation of an audio source object in e.g. a Virtual Reality experience application.
  • VR Virtual Reality
  • AR Augmented Reality
  • MR Mixed Reality
  • XR eXtended Reality
  • a number of standards are also under development by a number of standardization bodies. Such standardization activities are actively developing standards for the various aspects of VR/AR/MR systems including e.g. streaming, broadcasting, rendering, etc.
  • VR applications tend to provide user experiences corresponding to the user being in a different world/ environment/ scene whereas AR (including Mixed Reality MR) applications tend to provide user experiences corresponding to the user being in the current environment but with additional information or virtual objects or information being added.
  • VR applications tend to provide a fully immersive synthetically generated world/ scene whereas AR applications tend to provide a partially synthetic world/ scene which is overlaid the real scene in which the user is physically present.
  • the terms are often used interchangeably and have a high degree of overlap.
  • the term eXtended Reality/ XR will be used to denote both Virtual Reality and Augmented/ Mixed Reality.
  • a service being increasingly popular is the provision of images and audio in such a way that a user is able to actively and dynamically interact with the system to change parameters of the rendering such that this will adapt to movement and changes in the user's position and orientation.
  • a very appealing feature in many applications is the ability to change the effective viewing position and viewing direction of the viewer, such as for example allowing the viewer to move and "look around" in the scene being presented.
  • Such a feature can specifically allow a virtual reality experience to be provided to a user. This may allow the user to (relatively) freely move about in a virtual environment and dynamically change his position and where he is looking.
  • virtual reality applications are based on a three-dimensional model of the scene with the model being dynamically evaluated to provide the specific requested view. This approach is well known from e.g. game applications, such as in the category of first person shooters, for computers and consoles.
  • the image being presented is a three-dimensional image, typically presented using a stereoscopic display. Indeed, in order to optimize immersion of the viewer, it is typically preferred for the user to experience the presented scene as a three-dimensional scene. Indeed, a virtual reality experience should preferably allow a user to select his/her own position, viewpoint, and moment in time relative to a virtual world.
  • the audio preferably provides a spatial audio experience where audio sources are perceived to arrive from positions that correspond to the positions of the corresponding objects in the visual scene.
  • the audio and video scenes are preferably perceived to be consistent and with both providing a full spatial experience.
  • many immersive experiences are provided by a virtual audio scene being generated by headphone reproduction using binaural audio rendering technology.
  • headphone reproduction may be based on headtracking such that the rendering can be made responsive to the user's head movements, which highly increases the sense of immersion.
  • An important feature for many applications is that of how to generate and/or distribute audio that can provide a natural and realistic perception of the audio environment.
  • a particular challenge is to represent audio sources that are not limited to a single point source, i.e. which has a spatial acoustic extension/dimension.
  • 6DoF 6 Degrees of Freedom
  • the user is able to move freely during playback of the content, or in gaming during runtime of the application, and as such the content creator and any encoding algorithms do not know what the listening position may be at any moment in time, and as such where they are relative to the sound producing objects.
  • an audio object with a given location is typically rendered to the listener by first calculating the objects relative distance from the user and the direction from the listener to the object (e.g. as azimuth and elevation).
  • the audio signal associated with the object is then convolved with the matching Head Related Impulse Response (HRIR) or Head Related Transfer Function (HRTF).
  • HRIR Head Related Impulse Response
  • HRTF Head Related Transfer Function
  • the resulting (stereo) signal is presented to the listener via headphones with the corresponding distance related time delay and level attenuation.
  • An extent audio source has a spatial extension.
  • An extent audio source is a non- single point audio source.
  • an object may be described by using a simple geometric object (Line, Plane, Box, Sphere, Cone, Cylinder etc) or often by definition of a mesh consisting of vertices and faces.
  • the spatial data is often represented by complex mesh structures which may comprise a large number of polygons and vertices, edges, and faces.
  • the relative size of the extent of the audio source with respect to the listener is constantly changing as the listener moves within the 6DoF environment, and as such the way in which it is rendered needs to constantly be adapted.
  • the perceived width of the source also needs to be adapted.
  • the existing rendering methods require the calculation of the perceived source width, adapting the correlation and potentially other parameters of the audio signals associated with the source and any metadata describing the source, and applying these new correlation levels and parameters to the audio signals.
  • This requires the audio rendering technology to have detailed information relating to the object representing the acoustic extent, and to calculate relative perceived widths in real time.
  • an improved approach for rendering audio would be advantageous.
  • an approach that allows improved operation, increased flexibility, reduced complexity, facilitated implementation, an improved audio experience, improved audio quality, reduced computational burden, improved suitability for varying positions, improved performance for virtual/mixed/ augmented/ extended reality applications, improved perceptual cues for spatial audio, increased and/or facilitated adaptability, increased processing flexibility, improved and/or facilitated rendering of the spatial extent of audio source and/or improved performance and/or operation would be advantageous.
  • the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.
  • an audio apparatus comprising: a receiver arranged to receive an audio signal and a geometric model for an audio source object; a position circuit arranged to determine a set of point audio source positions for the audio source object in response to the geometric model, the point audio source positions being spatially distributed within the audio source object; a point audio signal generator arranged to generate at least one point audio source signal for the point audio source positions from the audio signal; and a data generator arranged to generate a data representation of the audio source object comprising the set of point audio source positions and the at least one point audio source signal.
  • the invention may allow improved and/or facilitated rendering of audio for audio source object having a spatial extent.
  • the invention may in many embodiments and scenarios generate a more naturally perceived acoustic spatial extent of an audio source object.
  • the approach may in many scenarios reduce computational resource requirements and usage substantially. In many embodiments, the approach may obviate the need for evaluating a geometric model of the audio source object for changing listening positions.
  • the approach may typically provide improved and/or reduced complexity processing in applications where listening positions may change dynamically, such as e.g. in many XR applications.
  • the generated data representation may be independent of the listening position. Rendering of audio based on the data representation may typically not require any processing /evaluation/ adaptation of a complex geometric model, such as e.g. a mesh model.
  • the rendering of audio output signals for the audio source object based on the generated data representation may often require reduced complexity and reduced computational resource requirements in comparison to a rendering of the same quality based directly on the audio signal and the geometric model.
  • the at least one point audio signal may for a given point audio source position represent audio of a point audio source located at the point audio source position.
  • the audio source object may be an audio source of a scene or environment.
  • the scene or environment may be a virtual or real scene.
  • the audio source object may represent audio for a virtual or real (scene) object.
  • the point audio source positions may be spread out in the audio source object, or in at least one region of the audio source object.
  • the position circuit 203 may generate spatially distributed point audio source positions by determining the point audio source positions to have a minimum distance to a closest neighbor point audio source position. The minimum distance may be predetermined or may be determined in response to a spatial property of the audio source object.
  • the position circuit is arranged to determine the point audio source positions such that a distance from any position of the set of point audio source positions to a surface of the audio source object does not exceed a distance threshold.
  • This may provide improved performance and/or reduced complexity/ resource demand. In many scenarios, it may allow fewer point audio source positions to be required for a given perceived spatial sound quality.
  • the surface may be a boundary or edge of the audio source object.
  • the distance threshold may be a predetermined distance, or may be dependent on a spatial property of the audio source object, such as a size or maximum dimension of the audio source object.
  • the position circuit may be arranged to determine the point audio source positions such that a distance from any position of the set of point audio source positions to a nearest point on a surface of the audio source object does not exceed a distance threshold.
  • the position circuit is arranged to determine a set of intersect positions of a grid and to select the set of point audio source positions from the intersect positions.
  • This may typically provide a low complexity yet high quality determination of point audio source positions. It may in many scenarios allow an improved spatial perception of the audio source object.
  • the position circuit is arranged to determine the set of point audio source positions to satisfy at least one requirement of: a requirement that a maximum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is less than a first distance threshold; a requirement that a minimum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is more than a second distance threshold; a requirement that a number of points of the set of point audio source positions does not exceed a first number; a requirement that a number of points of the set of point audio source positions is not below a second number; and a requirement that a maximum distance from any point of a surface of the audio source object to a nearest point of the set of point audio source positions is less than a third distance threshold.
  • This may provide improved performance and/or reduced complexity/ resource demand. It may allow point audio source positions to be determined which may ensure that a sufficient perception of the spatial extent of the audio source object is provided.
  • the point audio signal generator is arranged to determine at least one audio level for the set of point audio source positions in response to an audio level for audio source object and a number of positions in the set of point audio source positions.
  • the audio level for audio source object may be a desired/ target audio level for the audio source object.
  • the point audio signal generator is arranged to determine the at least one audio level in response to positions in the set of point audio source positions.
  • the audio level for audio source object may be a desired/ target audio level for the audio source object.
  • the data generator is arranged to generate the data representation to include a relative render priority for the set of point audio source positions.
  • An improved and typically more flexible operation can be achieved. For example, it may allow improved adaptation to resource availability of different renderers. It may in many scenarios assist in reducing the perceived impact of limited rendering resource.
  • the relative render priority for one point audio source position may indicate a rendering priority relative to other point audio source positions.
  • the data generator is arranged to generate the data representation to include directional sound propagation data for at least one position of the set of point audio source positions.
  • the position circuit is arranged to determine the set of point audio source positions in response to a two dimensional extent of the audio source object when viewed from a given region relative to a position of the audio source object.
  • This may allow improved and/or facilitated operation in many embodiments.
  • the position circuit is arranged to determine a number of dimensions for which the audio source object has an extent exceeding a threshold, and to generate the set of point audio source positions as a structure having the number of dimensions.
  • This may allow improved and/or facilitated operation in many embodiments.
  • the data generator is arranged to generate the data representation to include an indication of a propagation time parameter for the set of point audio source positions.
  • the audio apparatus comprises a renderer arranged to render the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  • the approach may in many scenarios allow an improved and/or facilitated rendering of an audio source object having a spatial extent.
  • the audio apparatus further comprises an encoder for generating an encoded bitstream comprising the data representation of the audio source object.
  • the approach may in many scenarios allow an improved encoded bitstream representing an audio source object having a spatial extent to be generated.
  • an audio apparatus comprising: a receiver for receiving an audio bitstream comprising a data representation of an audio source object having a spatial extent, the data representation comprising: a set of point audio source positions being distributed within the audio source object, and at least one point audio source signal for the point audio source positions; and a renderer arranged to render the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  • a method of operation for an audio apparatus comprising: receiving an audio signal and a geometric model for an audio source object; determining a set of point audio source positions for the audio source object in response to the geometric model, the point audio source positions being spatially distributed within the audio source object; generating at least one point audio source signal for the point audio source positions from the audio signal; and generating a data representation of the audio source object comprising the set of point audio source positions and the at least one point audio source signal.
  • a method of operation for an audio apparatus comprising: receiving an audio bitstream comprising a data representation of an audio source object having a spatial extent, the data representation comprising: a set of point audio source positions being distributed within the audio source object, and at least one point audio source signal for the point audio source positions; and rendering the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  • an audio bitstream comprising a data representation of an audio source object having a spatial extent; the data representation comprising: a set of point audio source positions being distributed within the audio source object; and at least one point audio source signal for the point audio source positions from the audio signal.
  • XR eXtended Reality
  • VR Virtual Reality
  • AR Augmented Reality
  • MR Mixed Reality
  • the described approach will focus on such applications where audio rendering is adapted to reflect acoustic variations and changes in the audio perception as a (possibly virtual) user / listener position changes.
  • audio rendering is adapted to reflect acoustic variations and changes in the audio perception as a (possibly virtual) user / listener position changes.
  • the described principles and concepts may be used in many other applications and embodiments.
  • the XR application may be provided locally to a viewer by e.g. a stand-alone device that does not use, or even have any access to, any remote XR data or processing.
  • a device such as a games console may comprise a store for storing the scene data, input for receiving/ generating the viewer pose, and a processor for generating the corresponding images from the scene data.
  • the XR application may be implemented and performed remote from the viewer.
  • a device local to the user may detect/ receive movement/ pose data which is transmitted to a remote device that processes the data to generate the viewer pose.
  • the remote device may then generate suitable view images and corresponding audio signals for the user pose based on scene data describing the scene.
  • the view images and corresponding audio signals are then transmitted to the device local to the viewer where they are presented.
  • the remote device may directly generate a video stream (typically a stereo/ 3D video stream) and corresponding audio stream which is directly presented by the local device.
  • the local device may not perform any XR processing except for transmitting movement data and presenting received video data.
  • the functionality may be distributed across a local device and remote device.
  • the local device may process received input and sensor data to generate user poses that are continuously transmitted to the remote XR device.
  • the remote XR device may then generate the corresponding view images and corresponding audio signals and transmit these to the local device for presentation.
  • the remote XR device may not directly generate the view images and corresponding audio signals but may select relevant scene data and transmit this to the local device, which may then generate the view images and corresponding audio signals that are presented.
  • the remote XR device may identify the closest capture point and extract the corresponding scene data (e.g. a set of object sources and their position metadata) and transmit this to the local device.
  • the local device may then process the received scene data to generate the images and audio signals for the specific, current user pose.
  • the user pose will typically correspond to the head pose, and references to the user pose may typically equivalently be considered to correspond to the references to the head pose.
  • a source may transmit or stream scene data in the form of an image (including video) and audio representation of the scene which is independent of the user pose. For example, signals and metadata corresponding to audio sources within the confines of a certain virtual room may be transmitted or streamed to a plurality of clients. The individual clients may then locally synthesize audio signals corresponding to the current user pose. Similarly, the source may transmit a general description of the audio environment including describing audio sources in the environment and acoustic characteristics of the environment. An audio representation may then be generated locally and presented to the user, for example using binaural rendering and processing.
  • FIG. 1 illustrates such an example of a XR system in which a remote XR client device 101 liaises with a XR server 103 e.g. via a network 105, such as the Internet.
  • the server 103 may be arranged to simultaneously support a potentially large number of client devices 101.
  • the XR server 103 may for example support a broadcast experience by transmitting an image signal comprising an image representation in the form of image data that can be used by the client devices to locally synthesize view images corresponding to the appropriate user poses (a pose refers to a position and/or orientation). Similarly, the XR server 103 may transmit an audio representation of the scene allowing the audio to be locally synthesized for the user poses. Specifically, as the user moves around in the virtual environment, the image and audio synthesized and presented to the user is updated to reflect the current (virtual) position and orientation of the user in the (virtual) environment.
  • the audio representation may include an audio representation for a plurality of different sound sources in the environment. This may include some sound sources corresponding to general diffuse and non-localized sound, such as ambient and backgrounds sounds for the environment as a whole. It will typically also comprise a number of audio representations for point size audio sources to be rendered from specific positions in the scene. However, in addition, the sound representation may include a number of audio sources that have a spatial extent (which in many cases may equivalently be referred to as having spatial extension), and which should preferably be rendered such that a user is provided with a perception of the spatial extent and perceived dimension (often width) of the sound source.
  • a spatial extent which in many cases may equivalently be referred to as having spatial extension
  • extent audio sources are represented by a geometric model that describes the spatial properties of the audio source object.
  • audio data together with data describing a geometric model may provide a representation of an extent audio source object.
  • the geometric model may typically be a model that may also be used to describe visual properties of the object corresponding to the audio source object and the same geometric model and data may be used to represent spatial properties of the object for both audio and visual properties and rendering.
  • a polygon mesh may be a collection of vertices, edges and faces that defines the shape of a polyhedral object.
  • texture data For visual properties, texture data, color, brightness, and possibly other visual properties may be provided for the polygons.
  • the visual rendering of such an object typically involves processing the mesh and applying the visual properties for the given viewing pose as is well known in the art.
  • a renderer may process the geometric model to determine spatial characteristics that are then used to adapt properties of the rendered audio such as diffusion, correlation etc.
  • properties of the rendered audio such as diffusion, correlation etc.
  • a determination of spatial properties and the associated signal properties tend to be very complex and resource demanding.
  • evaluating complex mesh models to determine how the perceived spatial width of an object varies as the user moves may require millions of operations and calculations, which may require a high computational resource. Indeed, in many cases it may even require dedicated hardware in order to allow real time processing and adaptation.
  • an approach will be described which may generate and use a representation of audio source objects with spatial extent that in many situations, scenarios, and applications may provide improved and/or facilitated operation. It may for example, reduce complexity and may provide an audio representation that does not require the complex processing of polygon meshes or other geometric models.
  • FIG. 2 illustrates an audio apparatus that can generate a data representation of an audio source object which has a spatial extent.
  • the audio apparatus may for example be part of the server 103 or the client 101 of FIG. 1 .
  • the audio apparatus comprises a receiver 201 which is arranged to receive at least one audio signal for an audio source object.
  • the receiver receives a geometric model which provides a description of the spatial properties of the audio source object.
  • the geometric model may specifically be a polygon mesh description of the audio source object. Such models are frequently used in the field to represent spatial properties for objects in a real or virtual environment. It may typically allow accurate representations and is frequently used in e.g. computer vision. In other embodiments, other geometric models may be used.
  • the geometric model may be defined as a simple 3D object such as a cuboid, sphere, cylinder, etc.
  • the audio signal for the audio source object may be provided as an encoded audio data signal describing the audio to be produced by the audio source object.
  • only a single audio signal may be provided to represent the audio to be produced by the audio source object.
  • more than one audio signal may be provided for one audio source object.
  • an audio signal may be provided to represent audio originating from one part (e.g. the left part) of the audio source object and a second audio signal may be provided to represent audio originating from a different part (e.g. the right part) of the audio source object.
  • Such audio signals may possibly be closely correlated but differ in some aspects, such as e.g. by having different levels, frequency spectra (filtering), include different signal components etc.
  • the audio signals may be provided as completely different audio signals and in other cases they may be provided as a first signal and data describing the difference of the second audio signal such as a filter or level adjustment to be applied to the first signal.
  • the audio apparatus further comprises a position circuit 203 which is arranged to determine a set of point audio source positions for the audio source object.
  • the position circuit determines the point audio sources as spatially distributed positions within the audio source object.
  • the audio apparatus further comprises a point audio signal generator 205 which is arranged to generate at least one point audio source signal for the single point audio positions.
  • the point audio signal generator 205 may in some embodiments, simply generate the point audio source signal as the first audio signal, i.e. in some embodiments and scenarios, the point audio signal may directly be the same as the audio signal received and representing the audio of the audio source object.
  • the point audio signal generator 205 may be arranged to generate the point audio source signal by processing the received first audio signal, such as for example by applying a filtering or level adaptation.
  • Each point audio source position of the set of point audio source positions indicates a position for a single point audio source of a set of single point audio sources which together form an audio source producing the audio source object.
  • the audio apparatus also comprises a data generator 207 which is coupled to the position circuit 203 and the point audio signal generator 205.
  • the data generator 207 is arranged to generate a data representation of the audio source object comprising the set of point audio source positions and the (at least one) point audio source signal.
  • the audio apparatus may generate a new representation of the audio source object with this being represented by a plurality of spatially distributed point audio source positions and associated point audio source signal(s).
  • the audio apparatus of FIG. 2 may receive a representation of a spatial extent audio source object given by an audio signal representing the audio of the audio source object and a geometric model for the audio source object spatial extent.
  • the audio apparatus may generate a new representation of the spatially extent audio source object where the spatial extent is represented by a plurality of point audio sources.
  • the approach may represent the audio source object as a plurality of point audio sources (each point audio source position corresponding to a position of a point audio source position of this plurality of audio sources).
  • a rendering of the audio source object may be based on this modified data representation and specifically the rendering may be performed by rendering a point source audio signal from each point audio source position based on the point audio source signal.
  • the rendering may for each point audio source position generate one rendered audio signal corresponding to a point audio source being positioned at the point audio source position.
  • the individual rendering for each audio source position may thus be performed without considering any spatial properties of the audio source object (except for the point audio source position).
  • the individual rendering audio signals for the different point audio source positions may then be combined such that the combined signal of the different positions represent the audio source object.
  • the approach may also in many cases provide a computationally efficient process of representing spatial extents of audio objects.
  • it may require only point source audio rendering and may obviate the necessity for evaluating a complex geometric model with respect to the current listening pose.
  • the geometric model may in many embodiments only be evaluated when generating the point audio source positions for the new representation of the audio source object.
  • Such an evaluation is typically performed only once as it is independent of the listening position whereas a traditional approach requires the model to be evaluated for each new listening position.
  • some additional complexity may sometimes be required for the rendering of multiple point source audio signals rather than when rendering fewer audio signals representing distributed audio sources, the overall computational reduction is typically very significant. This is especially the case for complex geometric models and shapes, such as specifically mesh representations of complex shapes.
  • the generation of the modified data representation may be performed once for multiple rendering devices and operations, such as for example by a central server serving multiple clients.
  • the approach may provide improved audio quality. For example, a reduced complexity may allow more resource to be allocated to more accurate rendering (including potentially of other audio sources in the environment). Further, in itself, the perception provided by multiple point sources may often be considered more accurate than that which is achieved by traditional approaches.
  • an audio source object to be rendered as emanating from a geometric extent object may be rendered with the spatial extension of the audio source object being represented using a plurality of individual audio point sources.
  • the audio apparatus of FIG. 2 may for example be included in a rendering device such as for example the client 101 of FIG. 1 .
  • the server 203 may generate an XR audio visual data stream which includes a mesh model for various objects in the environment/ scene. At least one of these objects may correspond to an audio object with spatial extent for which the audio visual data stream further comprises audio data.
  • the XR audio visual data stream may be transmitted to the client 101 which in the example includes the audio apparatus 200 of FIG. 2 as illustrated in the example of FIG. 3 .
  • the audio apparatus 200 may perform the described operation to generate a modified representation of the spatial extent audio source object by a plurality of distributed point audio sources and associated point audio source data.
  • the client 101 may include a renderer 301 which is arranged to render audio for the scene.
  • the renderer 301 is arranged to include functionality for spatially rendering audio from point sources. It will be appreciated that many algorithms and operations for such rendering is known to the skilled person, including for example HRTF and BRIR based rendering for headphones, and that these for brevity will not be described further herein.
  • the renderer 301 may specifically receive a listener pose and for that pose render a point source audio signal from each point audio source generated for the audio source object.
  • the audio for the spatial extent signal is rendered as a single diffuse signal, it is rendered as a plurality of point source audio signals from different positions within the audio source object.
  • the audio signal rendered for each point audio source signal may be rendered independently of any other audio signal rendering for any other point audio source position.
  • the rendering of one point source audio signal may thus be based only on the point source audio signal associated with the point audio source position and on the point audio source position.
  • Each point audio signal may be rendered independently of other point audio source positions and point audio signals.
  • the generated point audio signals may then be combined for rendering to the user e.g. via loudspeakers or headphones.
  • each rendering may be performed separately and taking into account only the point audio source position and the listening position/ pose. No consideration of the geometric extent of the audio source object is needed for the rendering.
  • each rendering may use well-known and efficient rendering algorithms and approaches. Further, no evaluation of a geometric model is required for each listening position.
  • the approach may represent an extent object as a number of discrete points, and then the standard object rendering method may be used to give the perception of source width without the need for decorrelation of the audio signals or for calculations of perceived width and decorrelation metrics, or for the storage of object surface descriptions.
  • the audio apparatus may be part of an audio visual bitstream generator, such as for example an encoder or server.
  • the audio apparatus 200 may in some embodiments be part of the server 103 of FIG. 1 .
  • the server 103 may receive or generate an audio visual data that includes a mesh model for various objects in an environment/ scene. At least one of these objects may correspond to a spatial extent audio object for which audio data is also provided/ generated.
  • Such data may be received by the audio apparatus 200 of the server 103 which may include elements as illustrated in FIG. 4 .
  • the audio apparatus of the server 103 may proceed to perform the described operations to generate a representation of the audio object using a plurality of point audio source positions and associated audio signal(s).
  • This representation may then be provided to an encoding unit 401 arranged to generate an encoded bitstream comprising the data representation of the audio source object.
  • the encoder may encode the point audio source positions and point audio signals in any suitable form, and it will be appreciated that many approaches for encoding audio visual signals are known and appropriate.
  • the audio signals may be encoded using a known audio encoding algorithm and format, and the point audio source positions may be encoded as metadata associated with the individual point audio source signal.
  • the encoder and in the specific example the server 103, may generate a audio bitstream that comprises a data representation of an audio source object having a spatial extent with the data representation comprising: a set of point audio source positions being distributed within the audio source object, and at least one point audio source signal for the single point audio positions from the first audio signal.
  • An advantage of the approach is in many embodiments that the point source based representation can be generated without any consideration of the specific listening position, i.e. the representation can be listening position independent.
  • the position circuit 203 may be arranged to determine a set of intersect positions of a grid and to select the set of point audio source positions from the intersect positions.
  • the grid may in many embodiments be a regular grid but may in some embodiments be an irregular or unstructured grid (such as e.g. a Ruppert's grid).
  • a regular grid may be a tessellation of n-dimensional space by parallelotopes and an irregular grid may be based on other shapes than parallelotopes.
  • the intersect points may be corners/ vertices of the parallelotopes/ shapes.
  • the intersect positions may be positions where lines or curves describing or defining the tessellation intersect.
  • the grid may specifically be a Cartesian grid. In many embodiments, the grid may be a one, two, or three dimensional grid of Euclidian space.
  • the point audio source positions may be determined as equidistant positions.
  • the distance from a point audio source position to a nearest neighbor point audio source position may in some embodiments be constant for different point audio source positions.
  • the position circuit 203 may align a predetermined regular grid with equidistant positions/ intersection points to the audio source object, and then evaluate the geometric model to identify the positions/ intersections that fall within the audio source object. These positions may then be selected as the point audio source positions for the audio source object.
  • the alignment between the audio source object and the grid may in some embodiments be in accordance with a suitable algorithm or criteria (e.g. a reference position of the grid is aligned with e.g. a lowest/ highest etc. point of the audio source object; a reference position of the grid is positioned within the object, such as e.g. at a center).
  • a suitable algorithm or criteria e.g. a reference position of the grid is aligned with e.g. a lowest/ highest etc. point of the audio source object; a reference position of the grid is positioned within the object, such as e.g. at a center.
  • an arbitrary alignment may be applied between the object and the grid (this may in particular be useful for embodiments
  • FIG. 5 An example of such an approach is illustrated in FIG. 5 .
  • the audio source object is an elongated object in which only one row/ column of positions falls within the object.
  • FIG. 6 illustrates an example for an object that has a significant extent in two directions but not in a third direction.
  • the position circuit 203 may be arranged to determine a number of dimensions for which the audio source object has a perceived extent exceeding a threshold. For example, for the audio source object of FIG. 5 , the position circuit 203 may determine that the audio source object only has a significant extent in only one direction, and thus that it is essentially a one dimensional object. For the object of FIG. 6 , the position circuit 203 may determine that it is an object with significant spatial extent in two directions and thus essentially is a two dimensional object.
  • the position circuit 203 may then generate the set of point audio source positions as a structure having the same number of dimensions as determined by the position circuit 203. For example, a grid having the same number of dimensions as determined by the position circuit 203 may be used to determine the point audio source positions.
  • the audio apparatus may proceed through the following steps
  • the position circuit 203 is arranged to determine the point audio source positions such that a distance from any of these positions to a boundary or surface of the audio source object is less than a distance. Thus, it may be required that a position is only included if it is sufficiently close to the surface of the object.
  • the audio apparatus may check that the point audio source positions are located within a prescribed distance of the surface of the audio source object, and it may remove any positions that are in the center of the object.
  • the position circuit 203 may be arranged to determine the point audio source positions to satisfy one or more requirements.
  • point audio source positions may be determined as random positions, and these may then be evaluated to see if they meet a set of one or more requirements. For example, an iterative operation may be performed where a new point audio source position is randomly generated at each iteration. The random point audio source position is then evaluated in accordance with the set of requirements, and if all are met, the point audio source position is stored as part of the data representation for the audio source object, and otherwise it is discarded. The process may continue until e.g. a given number of point audio source positions have been determined (with the number potentially being determined based on a spatial property of the audio source object, such as based on a size or volume of the audio source object).
  • the position circuit 203 may be arranged to determine the set of point audio source positions to satisfy one, more or all of the following:
  • the placement of point audio source positions is not controlled using a regular pattern, but instead the point audio source positions may be randomly generated within a certain set of rules, for example but not limited to, a minimum distance between individual sources, a maximum number of sources, or a minimum distance to the surface of the extent.
  • a point audio source signal for the point audio source positions may then be generated from this received audio signal and indeed in some embodiments the received audio signal may be used directly as the point audio source signal.
  • some processing may be included such as for example a filtering or level adjustment.
  • a plurality of point audio source signal may be generated for the point audio source positions for the audio source object. For example, slightly different audio signals may be generated for different parts of the audio source object (e.g. depending on how close they are to the surface or on which surface is the closest).
  • the multiple point audio source signals may be generated from a single received audio signal or in some cases multiple point audio source signals may be generated from received multiple audio signals.
  • each point audio source position is typically assigned one of the generated point audio source signals.
  • the point source audio signals may specifically be a duplication of the input audio signal, possibly with a level adjustment. If there is only one input audio signal, then all point audio source positions will typically be assigned the same point audio signal generator 205. If there are multiple input audio signals with associated metadata to indicate from which part of the audio source object they should be rendered, then the point audio source signal for the individual point audio source position may be determined based on audio signal that most closely matches the point audio source position.
  • the level of the point audio source signals may typically be adapted. This adaptation may be included to compensate for the audio of the audio source object being represented by multiple audio sources.
  • the audio levels of the point audio source signals may thus specifically depend on the number of point audio source positions and on their position. For example, in order to render the audio source object with a given audio level using a plurality of point audio sources, the level for each point audio source may be adjusted such that the combined effect of all point audio sources combine to provide the desired audio level.
  • the point audio signal generator 205 may be arranged to determine at least one audio level for the set of point audio source positions in response to an audio level of audio source object and a number of positions in the set of point audio source positions. The level may further be determined in response to positions in the set of point audio source positions.
  • the original sound source level for the audio source object may be defined by a reference distance, with this being the distance at which the level of the audio signal is known either as an absolute reproduction level or a gain level that should be applied at that distance from the object.
  • a number of measurement positions may be determined at the reference distance from the surface of the audio source object (see FIG. 7 ), and the audio level for the point audio source positions/ point audio source signals may be adjusted such that when all point audio source signals are rendered, the level received across all of the measurement positions is equal to that which would be received were a single object rendered at the nearest point on the audio source object surface to the measurement position.
  • Such a determination may be performed using an automated optimization function or other estimation paradigm.
  • the data generator 207 may be arranged to generate the data representation to include a relative render priority for the set of point audio source positions.
  • Metadata may be generated which indicates a relative priority of the point audio source position relative to other point audio source positions.
  • the priority could for example be provided as a ranking of all of the point audio source positions in order of importance for the rendering in accordance with any suitable criterion.
  • the point audio source positions may be allocated a rendering priority from a given set of possible rendering priorities. For example, each point audio source position may be indicated as being mandatory, preferred, or optional.
  • the renderer may in such embodiments select a set of point audio source positions from which to render audio signals in response to the relative render priority.
  • the rendering for the audio source object may then be by rendering of the selected set of point audio source positions.
  • the renderer may in some scenarios select a subset of point audio source positions to render based on the rendering priorities.
  • the renderer may, if more point audio source positions are received than can be rendered, proceed to select a subset based on the rendering priorities. In this way, the rendering can be controlled to typically provide an improved audio experience.
  • the rendering priority may be determined based on spatial relationships between the point audio source positions and possibly relative to spatial properties of the audio source object.
  • all point audio source positions that are closest to a part of the surface of the audio source object may be indicated to be mandatory or have the highest rendering priority.
  • a preferred rendering priority may be given to point audio source positions that have a given minimum distance to the point audio source positions that were given the highest rendering priority, and a given minimum distance to other point audio source positions that are given a preferred rendering priority.
  • all other point audio source positions may be assigned an optional rendering priority.
  • a constrained renderer may in this case select which point audio source positions to render by first selecting all point audio source positions that are indicated to be mandatory, then point audio source positions that are indicated to be preferred, and finally if sufficient resource is still available may select point audio source positions that are indicated to be optional. If only some point audio source positions of a given category can be selected (e.g. due to resource constraints), a suitable selection criterion may be used (including e.g. merely randomly selecting point audio source positions within a given category, or e.g. selecting point audio source positions to have the maximum distance between them).
  • Such an approach may provide improved spatial audio perception for constrained rendering by e.g. ensuring that point audio source positions contributing most significantly to the perception of the spatial extent are rendered.
  • the priority indication may be dependent on a distance between the audio source object/ point audio source position and a listening position.
  • Such a priority may be used to reduce the number of point audio source positions that are rendered based on the user-source distance, i.e. the distance between the listening position and the audio source object/point audio source position. As the listening position moves further from the audio source object, the perceived relative width of the audio source object decreases and as such the number of points needed to give an adequate perceived width also decreases. Such an approach may allow a reduction in rendering complexity with increasing source distance.
  • the renderer may for example determine an audio source object or point audio source position to listening position distance and then select the point audio source positions for rendering in response to this distance and the relative distance dependent rendering priorities.
  • the data generator 207 may be arranged to generate the data representation to include directional sound propagation data for at least one position of the set of point audio source positions. The rendering may then render the audio signal from that point audio source position in response to the directional sound propagation data for the position.
  • directional response data may be added to the metadata for each of the additional point audio source positions such that e.g. the reproduction when the listener is external to the audio source object may be controlled separately to the internal reproduction.
  • a directional response may be described as a polar pattern, or some other representation known to the rendering technology.
  • a gain as a function of direction may be provided for a point audio source position and when rendering a point audio signal from that point audio source position the gain of the rendered signal may be determined in response to the gain provided for the direction from the point audio source position to the listening position.
  • the audio apparatus may be arranged to generate the point audio source positions based on a spatial relationship between the audio source object and a reference viewing/ listening region.
  • the reference viewing/ listening region may specifically be a region to which the listener is constrained, or e.g. within which it is likely that the listener is positioned.
  • the position circuit 203 may be arranged to determine the set of point audio source positions in response to a two dimensional extent of the audio source object when viewed from a given region relative to a position of the audio source object. In same embodiments, the position circuit 203 may be arranged to determine the set of point audio source positions in response to a representation of the audio source object when viewed from a given listening region relative to a position of the audio source object.
  • the listener may be limited to a region of a virtual environment, and the audio apparatus may be aware of this region.
  • the position circuit 203 may determine the maximum visible angle of the audio source object with respect to one or more positions within the listeners region and it may only consider the dimensions where the viewable angle exceeds a given threshold. This viewable angle can also inform the audio apparatus of the number of point audio source positions that should be used to provide the desired perceived source width.
  • the data generator 207 may further be arranged to include an indication of a propagation time parameter for the set of point audio source positions.
  • the renderer may be arranged to render audio signals from the point audio source positions in response to the propagation time parameter. Specifically, in many embodiments, the renderer may be arranged to render the audio signals from all point audio source positions with the same propagation time, and with this propagation time being determined from the propagation time parameter.
  • the audio sources are generated to have the same propagation time.
  • the audio signals may be generated with other spatial cues reflecting the different positions of the point audio source positions. For example, differential delays between a right ear signal and a left ear signal (for HRTF processing) may reflect the actual position within the audio source object etc. Indeed, such an approach has been found to provide improved spatial perception in many scenarios and to in particular provide a perception of a cohesive audio source object with spatial extent.
  • the distance dependent delay that is used by the renderer to simulate the time taken for the audio to propagate from the audio source object to the listener may be controlled by additional metadata linked to point audio source positions.
  • the metadata may indicate that the time of flight for all sources should be considered to be the same as e.g. that of the nearest source to the listener or some other metric.
  • An extent of an audio source may be a spatial extension of an audio source.
  • An extent audio source may be an audio source having a spatial extent.
  • An extent audio source may be a spatially extended audio source.
  • An audio source having an extent may be referred to as an audio source having a spatial extension.
  • the invention can be implemented in any suitable form including hardware, software, firmware or any combination of these.
  • the invention may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors.
  • the elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

An audio apparatus comprises a receiver (201) receiving an audio signal and a geometric model for an audio source object. The geometric model may be a polygon mesh. A position circuit (203) determines a set of point audio source positions for the audio source object based on the geometric model such that the point audio source positions are spatially distributed within the audio source object. A point audio signal generator (205) generates a point audio source signal for the point audio source positions from the audio signal. In some cases, the point audio signal may be a copy of the received audio signal. A data generator (207) generates a data representation of the audio source object comprising the set of point audio source positions and the at least one point audio source signal. A renderer (301) may render the audio source object by rendering point audio source signals from the point audio source positions.

Description

    FIELD OF THE INVENTION
  • The invention relates to an apparatus and method of operation therefor, and in particular, but not exclusively, to an approach for generating a data representation of an audio source object in e.g. a Virtual Reality experience application.
  • BACKGROUND OF THE INVENTION
  • The variety and range of experiences based on audiovisual content have increased substantially in recent years with new services and ways of utilizing and consuming such content continuously being developed and introduced. In particular, many spatial and interactive services, applications and experiences are being developed to give users a more involved and immersive experience.
  • Examples of such applications are Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR) applications (commonly often referred to as eXtended Reality (XR)), which are rapidly becoming mainstream, with a number of solutions being aimed at the consumer market. A number of standards are also under development by a number of standardization bodies. Such standardization activities are actively developing standards for the various aspects of VR/AR/MR systems including e.g. streaming, broadcasting, rendering, etc.
  • VR applications tend to provide user experiences corresponding to the user being in a different world/ environment/ scene whereas AR (including Mixed Reality MR) applications tend to provide user experiences corresponding to the user being in the current environment but with additional information or virtual objects or information being added. Thus, VR applications tend to provide a fully immersive synthetically generated world/ scene whereas AR applications tend to provide a partially synthetic world/ scene which is overlaid the real scene in which the user is physically present. However, the terms are often used interchangeably and have a high degree of overlap. In the following, the term eXtended Reality/ XR will be used to denote both Virtual Reality and Augmented/ Mixed Reality.
  • As an example, a service being increasingly popular is the provision of images and audio in such a way that a user is able to actively and dynamically interact with the system to change parameters of the rendering such that this will adapt to movement and changes in the user's position and orientation. A very appealing feature in many applications is the ability to change the effective viewing position and viewing direction of the viewer, such as for example allowing the viewer to move and "look around" in the scene being presented.
  • Such a feature can specifically allow a virtual reality experience to be provided to a user. This may allow the user to (relatively) freely move about in a virtual environment and dynamically change his position and where he is looking. Typically, such virtual reality applications are based on a three-dimensional model of the scene with the model being dynamically evaluated to provide the specific requested view. This approach is well known from e.g. game applications, such as in the category of first person shooters, for computers and consoles.
  • It is also desirable, in particular for virtual reality applications, that the image being presented is a three-dimensional image, typically presented using a stereoscopic display. Indeed, in order to optimize immersion of the viewer, it is typically preferred for the user to experience the presented scene as a three-dimensional scene. Indeed, a virtual reality experience should preferably allow a user to select his/her own position, viewpoint, and moment in time relative to a virtual world.
  • In addition to the visual rendering, most XR applications further provide a corresponding audio experience. In many applications, the audio preferably provides a spatial audio experience where audio sources are perceived to arrive from positions that correspond to the positions of the corresponding objects in the visual scene. Thus, the audio and video scenes are preferably perceived to be consistent and with both providing a full spatial experience.
  • For example, many immersive experiences are provided by a virtual audio scene being generated by headphone reproduction using binaural audio rendering technology. In many scenarios, such headphone reproduction may be based on headtracking such that the rendering can be made responsive to the user's head movements, which highly increases the sense of immersion.
  • An important feature for many applications is that of how to generate and/or distribute audio that can provide a natural and realistic perception of the audio environment.
  • A particular challenge is to represent audio sources that are not limited to a single point source, i.e. which has a spatial acoustic extension/dimension.
  • In audio rendering the situation often occurs where one or more audio signals are meant to represent an object with large physical properties or a diffuse sound source. In traditional listening environments such as cinemas or home theatres, this is typically achieved by first converting the input signal into multiple output signals with varying levels of decorrelation, and then feeding those signals to individual loudspeakers or headphone channels so that they produce the perception of acoustic width, as e.g. described in US9654895B2 . Varying the level of correlation between the differing signals can affect the size of the object as perceived by the listener. This method works well in the controlled listening environments where the listener position, the sound transducers, and the simulated object position are all known and controlled for.
  • However, in XR applications, users may have free movement in all dimensions, typically referred to as 6 Degrees of Freedom (6DoF). With 6DoF the user is able to move freely during playback of the content, or in gaming during runtime of the application, and as such the content creator and any encoding algorithms do not know what the listening position may be at any moment in time, and as such where they are relative to the sound producing objects.
  • In a typical 6DoF, environment an audio object with a given location is typically rendered to the listener by first calculating the objects relative distance from the user and the direction from the listener to the object (e.g. as azimuth and elevation). For headphone applications, the audio signal associated with the object is then convolved with the matching Head Related Impulse Response (HRIR) or Head Related Transfer Function (HRTF). The resulting (stereo) signal is presented to the listener via headphones with the corresponding distance related time delay and level attenuation. Using this method will accurately represent an object as a point source, with all sound emanating from a single point in space.
  • Often sound should not emanate from a zero-dimensional point in space, rather it may emanate from a 1-dimensional (line), 2-dimensional (plane), or 3-dimensional (solid) audio object that should radiate energy uniformly from all points. When an object is used in this way, it is in the field referred to as the extent of the audio source. An extent audio source has a spatial extension. An extent audio source is a non- single point audio source. When describing a virtual environment an object may be described by using a simple geometric object (Line, Plane, Box, Sphere, Cone, Cylinder etc) or often by definition of a mesh consisting of vertices and faces. In particular, for more complex objects and environments, the spatial data is often represented by complex mesh structures which may comprise a large number of polygons and vertices, edges, and faces.
  • The relative size of the extent of the audio source with respect to the listener is constantly changing as the listener moves within the 6DoF environment, and as such the way in which it is rendered needs to constantly be adapted. As the sound source changes with respect to the user position, either through user motion or animation of the sound source, the perceived width of the source also needs to be adapted.
  • In order to represent such effects, the existing rendering methods require the calculation of the perceived source width, adapting the correlation and potentially other parameters of the audio signals associated with the source and any metadata describing the source, and applying these new correlation levels and parameters to the audio signals. This requires the audio rendering technology to have detailed information relating to the object representing the acoustic extent, and to calculate relative perceived widths in real time.
  • However, such processing and calculations tend to very complex and resource demanding and typically the rendering approaches will tend to be compromised on complexity, resource demands and/or the resulting audio quality (and specifically on the spatial perception). For example, for a complex object described by a mesh representation, this may involve many thousands of vertices and faces in the description of the object which results in very resource demanding calculations being required to adapt the rendering to provide an acceptable audio perception.
  • Hence, an improved approach for rendering audio would be advantageous. In particular, an approach that allows improved operation, increased flexibility, reduced complexity, facilitated implementation, an improved audio experience, improved audio quality, reduced computational burden, improved suitability for varying positions, improved performance for virtual/mixed/ augmented/ extended reality applications, improved perceptual cues for spatial audio, increased and/or facilitated adaptability, increased processing flexibility, improved and/or facilitated rendering of the spatial extent of audio source and/or improved performance and/or operation would be advantageous.
  • SUMMARY OF THE INVENTION
  • Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.
  • According to an aspect of the invention there is provided an audio apparatus comprising: a receiver arranged to receive an audio signal and a geometric model for an audio source object; a position circuit arranged to determine a set of point audio source positions for the audio source object in response to the geometric model, the point audio source positions being spatially distributed within the audio source object; a point audio signal generator arranged to generate at least one point audio source signal for the point audio source positions from the audio signal; and a data generator arranged to generate a data representation of the audio source object comprising the set of point audio source positions and the at least one point audio source signal.
  • The invention may allow improved and/or facilitated rendering of audio for audio source object having a spatial extent. The invention may in many embodiments and scenarios generate a more naturally perceived acoustic spatial extent of an audio source object. The approach may in many scenarios reduce computational resource requirements and usage substantially. In many embodiments, the approach may obviate the need for evaluating a geometric model of the audio source object for changing listening positions.
  • The approach may typically provide improved and/or reduced complexity processing in applications where listening positions may change dynamically, such as e.g. in many XR applications.
  • The generated data representation may be independent of the listening position. Rendering of audio based on the data representation may typically not require any processing /evaluation/ adaptation of a complex geometric model, such as e.g. a mesh model.
  • The rendering of audio output signals for the audio source object based on the generated data representation may often require reduced complexity and reduced computational resource requirements in comparison to a rendering of the same quality based directly on the audio signal and the geometric model.
  • The at least one point audio signal may for a given point audio source position represent audio of a point audio source located at the point audio source position.
  • The audio source object may be an audio source of a scene or environment. The scene or environment may be a virtual or real scene. The audio source object may represent audio for a virtual or real (scene) object.
  • The point audio source positions may be spread out in the audio source object, or in at least one region of the audio source object. The position circuit 203 may generate spatially distributed point audio source positions by determining the point audio source positions to have a minimum distance to a closest neighbor point audio source position. The minimum distance may be predetermined or may be determined in response to a spatial property of the audio source object.
  • In accordance with an optional feature of the invention, the position circuit is arranged to determine the point audio source positions such that a distance from any position of the set of point audio source positions to a surface of the audio source object does not exceed a distance threshold.
  • This may provide improved performance and/or reduced complexity/ resource demand. In many scenarios, it may allow fewer point audio source positions to be required for a given perceived spatial sound quality.
  • The surface may be a boundary or edge of the audio source object. The distance threshold may be a predetermined distance, or may be dependent on a spatial property of the audio source object, such as a size or maximum dimension of the audio source object.
  • In some embodiments, the position circuit may be arranged to determine the point audio source positions such that a distance from any position of the set of point audio source positions to a nearest point on a surface of the audio source object does not exceed a distance threshold.
  • In accordance with an optional feature of the invention, the position circuit is arranged to determine a set of intersect positions of a grid and to select the set of point audio source positions from the intersect positions.
  • This may typically provide a low complexity yet high quality determination of point audio source positions. It may in many scenarios allow an improved spatial perception of the audio source object.
  • In accordance with an optional feature of the invention, the position circuit is arranged to determine the set of point audio source positions to satisfy at least one requirement of: a requirement that a maximum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is less than a first distance threshold; a requirement that a minimum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is more than a second distance threshold; a requirement that a number of points of the set of point audio source positions does not exceed a first number; a requirement that a number of points of the set of point audio source positions is not below a second number; and a requirement that a maximum distance from any point of a surface of the audio source object to a nearest point of the set of point audio source positions is less than a third distance threshold.
  • This may provide improved performance and/or reduced complexity/ resource demand. It may allow point audio source positions to be determined which may ensure that a sufficient perception of the spatial extent of the audio source object is provided.
  • In accordance with an optional feature of the invention, the point audio signal generator is arranged to determine at least one audio level for the set of point audio source positions in response to an audio level for audio source object and a number of positions in the set of point audio source positions.
  • This may allow an improved and typically more realistic spatial perception of the audio source object. It may reduce perceived distortion to the audio source object. The audio level for audio source object may be a desired/ target audio level for the audio source object.
  • In accordance with an optional feature of the invention, the point audio signal generator is arranged to determine the at least one audio level in response to positions in the set of point audio source positions.
  • This may allow an improved and typically more realistic spatial perception of the audio source object. It may reduce perceived distortion to the audio source object. The audio level for audio source object may be a desired/ target audio level for the audio source object.
  • In accordance with an optional feature of the invention, the data generator is arranged to generate the data representation to include a relative render priority for the set of point audio source positions.
  • An improved and typically more flexible operation can be achieved. For example, it may allow improved adaptation to resource availability of different renderers. It may in many scenarios assist in reducing the perceived impact of limited rendering resource. The relative render priority for one point audio source position may indicate a rendering priority relative to other point audio source positions.
  • In accordance with an optional feature of the invention, the data generator is arranged to generate the data representation to include directional sound propagation data for at least one position of the set of point audio source positions.
  • This may allow improved and/or more flexible rendering of the audio source object.
  • In accordance with an optional feature of the invention, the position circuit is arranged to determine the set of point audio source positions in response to a two dimensional extent of the audio source object when viewed from a given region relative to a position of the audio source object.
  • This may allow improved and/or facilitated operation in many embodiments.
  • In accordance with an optional feature of the invention, the position circuit is arranged to determine a number of dimensions for which the audio source object has an extent exceeding a threshold, and to generate the set of point audio source positions as a structure having the number of dimensions.
  • This may allow improved and/or facilitated operation in many embodiments.
  • In accordance with an optional feature of the invention, the data generator is arranged to generate the data representation to include an indication of a propagation time parameter for the set of point audio source positions.
  • This may allow increased perceived audio quality in many scenarios. It may in particular in many scenarios allow a more consistent perception of the audio source object.
  • In accordance with an optional feature of the invention, the audio apparatus comprises a renderer arranged to render the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  • The approach may in many scenarios allow an improved and/or facilitated rendering of an audio source object having a spatial extent.
  • In accordance with an optional feature of the invention, the audio apparatus further comprises an encoder for generating an encoded bitstream comprising the data representation of the audio source object.
  • The approach may in many scenarios allow an improved encoded bitstream representing an audio source object having a spatial extent to be generated.
  • According to an aspect of the invention there is provided an audio apparatus comprising: a receiver for receiving an audio bitstream comprising a data representation of an audio source object having a spatial extent, the data representation comprising: a set of point audio source positions being distributed within the audio source object, and at least one point audio source signal for the point audio source positions; and a renderer arranged to render the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  • According to an aspect of the invention there is provided a method of operation for an audio apparatus, the method comprising: receiving an audio signal and a geometric model for an audio source object; determining a set of point audio source positions for the audio source object in response to the geometric model, the point audio source positions being spatially distributed within the audio source object; generating at least one point audio source signal for the point audio source positions from the audio signal; and generating a data representation of the audio source object comprising the set of point audio source positions and the at least one point audio source signal.
  • According to an aspect of the invention there is provided a method of operation for an audio apparatus, the method comprising: receiving an audio bitstream comprising a data representation of an audio source object having a spatial extent, the data representation comprising: a set of point audio source positions being distributed within the audio source object, and at least one point audio source signal for the point audio source positions; and rendering the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  • According to an aspect of the invention there is provided an audio bitstream comprising a data representation of an audio source object having a spatial extent; the data representation comprising: a set of point audio source positions being distributed within the audio source object; and at least one point audio source signal for the point audio source positions from the audio signal.
  • These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which
    • FIG. 1 illustrates an example of elements of an eXtended Reality system;
    • FIG. 2 illustrates an example of an audio apparatus in accordance with some embodiments of the invention;
    • FIG. 3 illustrates an example of an encoder audio apparatus accordance with some embodiments of the invention;
    • FIG. 4 illustrates an example of a renderer audio apparatus accordance with some embodiments of the invention;
    • FIG. 5 illustrates an example an audio object and point audio source signals;
    • FIG. 6 illustrates an example an audio object and point audio source signals; and
    • FIG. 7 illustrates an example an audio object and point audio source signals.
    DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION
  • The following description will focus on audio processing and rendering for an eXtended Reality (XR) application, such as for a Virtual Reality (VR), Augmented Reality (AR), or Mixed Reality (MR) application. The described approach will focus on such applications where audio rendering is adapted to reflect acoustic variations and changes in the audio perception as a (possibly virtual) user / listener position changes. However, it will be appreciated that the described principles and concepts may be used in many other applications and embodiments.
  • Semi or fully virtual experiences allowing a user to move around in a (possibly partially) virtual world are becoming increasingly popular and services are being developed to satisfy such a demand.
  • In some systems, the XR application may be provided locally to a viewer by e.g. a stand-alone device that does not use, or even have any access to, any remote XR data or processing. For example, a device such as a games console may comprise a store for storing the scene data, input for receiving/ generating the viewer pose, and a processor for generating the corresponding images from the scene data.
  • In other systems, the XR application may be implemented and performed remote from the viewer. For example, a device local to the user may detect/ receive movement/ pose data which is transmitted to a remote device that processes the data to generate the viewer pose. The remote device may then generate suitable view images and corresponding audio signals for the user pose based on scene data describing the scene. The view images and corresponding audio signals are then transmitted to the device local to the viewer where they are presented. For example, the remote device may directly generate a video stream (typically a stereo/ 3D video stream) and corresponding audio stream which is directly presented by the local device. Thus, in such an example, the local device may not perform any XR processing except for transmitting movement data and presenting received video data.
  • In many systems, the functionality may be distributed across a local device and remote device. For example, the local device may process received input and sensor data to generate user poses that are continuously transmitted to the remote XR device. The remote XR device may then generate the corresponding view images and corresponding audio signals and transmit these to the local device for presentation. In other systems, the remote XR device may not directly generate the view images and corresponding audio signals but may select relevant scene data and transmit this to the local device, which may then generate the view images and corresponding audio signals that are presented. For example, the remote XR device may identify the closest capture point and extract the corresponding scene data (e.g. a set of object sources and their position metadata) and transmit this to the local device. The local device may then process the received scene data to generate the images and audio signals for the specific, current user pose. The user pose will typically correspond to the head pose, and references to the user pose may typically equivalently be considered to correspond to the references to the head pose.
  • In many applications, especially for broadcast services, a source may transmit or stream scene data in the form of an image (including video) and audio representation of the scene which is independent of the user pose. For example, signals and metadata corresponding to audio sources within the confines of a certain virtual room may be transmitted or streamed to a plurality of clients. The individual clients may then locally synthesize audio signals corresponding to the current user pose. Similarly, the source may transmit a general description of the audio environment including describing audio sources in the environment and acoustic characteristics of the environment. An audio representation may then be generated locally and presented to the user, for example using binaural rendering and processing.
  • FIG. 1 illustrates such an example of a XR system in which a remote XR client device 101 liaises with a XR server 103 e.g. via a network 105, such as the Internet. The server 103 may be arranged to simultaneously support a potentially large number of client devices 101.
  • The XR server 103 may for example support a broadcast experience by transmitting an image signal comprising an image representation in the form of image data that can be used by the client devices to locally synthesize view images corresponding to the appropriate user poses (a pose refers to a position and/or orientation). Similarly, the XR server 103 may transmit an audio representation of the scene allowing the audio to be locally synthesized for the user poses. Specifically, as the user moves around in the virtual environment, the image and audio synthesized and presented to the user is updated to reflect the current (virtual) position and orientation of the user in the (virtual) environment.
  • In many applications, such as that of FIG. 1, it may thus be desirable to model a scene and generate an efficient image and audio representation that can be efficiently included in a data signal that can then be transmitted or streamed to various devices which can locally synthesize views and audio for different poses than the capture poses.
  • The audio representation may include an audio representation for a plurality of different sound sources in the environment. This may include some sound sources corresponding to general diffuse and non-localized sound, such as ambient and backgrounds sounds for the environment as a whole. It will typically also comprise a number of audio representations for point size audio sources to be rendered from specific positions in the scene. However, in addition, the sound representation may include a number of audio sources that have a spatial extent (which in many cases may equivalently be referred to as having spatial extension), and which should preferably be rendered such that a user is provided with a perception of the spatial extent and perceived dimension (often width) of the sound source.
  • Typically, such extent audio sources are represented by a geometric model that describes the spatial properties of the audio source object. Thus, audio data together with data describing a geometric model may provide a representation of an extent audio source object.
  • The geometric model may typically be a model that may also be used to describe visual properties of the object corresponding to the audio source object and the same geometric model and data may be used to represent spatial properties of the object for both audio and visual properties and rendering.
  • An often used approach is to represent objects, and the scene in general, is a mesh. Specifically, a polygon mesh may be a collection of vertices, edges and faces that defines the shape of a polyhedral object. For visual properties, texture data, color, brightness, and possibly other visual properties may be provided for the polygons. The visual rendering of such an object typically involves processing the mesh and applying the visual properties for the given viewing pose as is well known in the art.
  • When rendering audio for an audio source object having a spatial extent, a renderer may process the geometric model to determine spatial characteristics that are then used to adapt properties of the rendered audio such as diffusion, correlation etc. However, such a determination of spatial properties and the associated signal properties tend to be very complex and resource demanding. For example, evaluating complex mesh models to determine how the perceived spatial width of an object varies as the user moves may require millions of operations and calculations, which may require a high computational resource. Indeed, in many cases it may even require dedicated hardware in order to allow real time processing and adaptation.
  • In the following, an approach will be described which may generate and use a representation of audio source objects with spatial extent that in many situations, scenarios, and applications may provide improved and/or facilitated operation. It may for example, reduce complexity and may provide an audio representation that does not require the complex processing of polygon meshes or other geometric models.
  • FIG. 2 illustrates an audio apparatus that can generate a data representation of an audio source object which has a spatial extent. The audio apparatus may for example be part of the server 103 or the client 101 of FIG. 1.
  • The audio apparatus comprises a receiver 201 which is arranged to receive at least one audio signal for an audio source object. In addition, the receiver receives a geometric model which provides a description of the spatial properties of the audio source object.
  • The geometric model may specifically be a polygon mesh description of the audio source object. Such models are frequently used in the field to represent spatial properties for objects in a real or virtual environment. It may typically allow accurate representations and is frequently used in e.g. computer vision. In other embodiments, other geometric models may be used. For example, the geometric model may be defined as a simple 3D object such as a cuboid, sphere, cylinder, etc.
  • The audio signal for the audio source object may be provided as an encoded audio data signal describing the audio to be produced by the audio source object. In some cases, only a single audio signal may be provided to represent the audio to be produced by the audio source object However, in some scenarios, more than one audio signal may be provided for one audio source object. For example, an audio signal may be provided to represent audio originating from one part (e.g. the left part) of the audio source object and a second audio signal may be provided to represent audio originating from a different part (e.g. the right part) of the audio source object. Such audio signals may possibly be closely correlated but differ in some aspects, such as e.g. by having different levels, frequency spectra (filtering), include different signal components etc. In some cases, the audio signals may be provided as completely different audio signals and in other cases they may be provided as a first signal and data describing the difference of the second audio signal such as a filter or level adjustment to be applied to the first signal.
  • The audio apparatus further comprises a position circuit 203 which is arranged to determine a set of point audio source positions for the audio source object. The position circuit determines the point audio sources as spatially distributed positions within the audio source object.
  • The audio apparatus further comprises a point audio signal generator 205 which is arranged to generate at least one point audio source signal for the single point audio positions. The point audio signal generator 205 may in some embodiments, simply generate the point audio source signal as the first audio signal, i.e. in some embodiments and scenarios, the point audio signal may directly be the same as the audio signal received and representing the audio of the audio source object. In other embodiments, the point audio signal generator 205 may be arranged to generate the point audio source signal by processing the received first audio signal, such as for example by applying a filtering or level adaptation. Each point audio source position of the set of point audio source positions indicates a position for a single point audio source of a set of single point audio sources which together form an audio source producing the audio source object.
  • The audio apparatus also comprises a data generator 207 which is coupled to the position circuit 203 and the point audio signal generator 205. The data generator 207 is arranged to generate a data representation of the audio source object comprising the set of point audio source positions and the (at least one) point audio source signal. Thus, the audio apparatus may generate a new representation of the audio source object with this being represented by a plurality of spatially distributed point audio source positions and associated point audio source signal(s). Thus, the audio apparatus of FIG. 2 may receive a representation of a spatial extent audio source object given by an audio signal representing the audio of the audio source object and a geometric model for the audio source object spatial extent. It may from this generate a different representation of the spatially extended audio source object as a plurality of spatially distributed point audio source positions and associate point audio source signal(s). The audio apparatus may generate a new representation of the spatially extent audio source object where the spatial extent is represented by a plurality of point audio sources. Rather than representing the audio source object as a single distributed audio source, the approach may represent the audio source object as a plurality of point audio sources (each point audio source position corresponding to a position of a point audio source position of this plurality of audio sources).
  • A rendering of the audio source object may be based on this modified data representation and specifically the rendering may be performed by rendering a point source audio signal from each point audio source position based on the point audio source signal. Thus, the rendering may for each point audio source position generate one rendered audio signal corresponding to a point audio source being positioned at the point audio source position. The individual rendering for each audio source position may thus be performed without considering any spatial properties of the audio source object (except for the point audio source position). The individual rendering audio signals for the different point audio source positions may then be combined such that the combined signal of the different positions represent the audio source object.
  • Although such an approach does not generate individual signals that reflect the spatial extent of the audio source object, but rather generates multiple audio signals that each represent a point audio source, the Inventor has realized that this in practice tends to provide spatial cues that provide a perception of the spatial extent of the audio source object. Indeed, it has been found that the perceived spatial properties may often very closely reflect those of the audio source object.
  • The approach may also in many cases provide a computationally efficient process of representing spatial extents of audio objects. In particular, it may require only point source audio rendering and may obviate the necessity for evaluating a complex geometric model with respect to the current listening pose. Rather, the geometric model may in many embodiments only be evaluated when generating the point audio source positions for the new representation of the audio source object. Such an evaluation is typically performed only once as it is independent of the listening position whereas a traditional approach requires the model to be evaluated for each new listening position. Thus, whereas some additional complexity may sometimes be required for the rendering of multiple point source audio signals rather than when rendering fewer audio signals representing distributed audio sources, the overall computational reduction is typically very significant. This is especially the case for complex geometric models and shapes, such as specifically mesh representations of complex shapes.
  • Further, in many embodiments, the generation of the modified data representation may be performed once for multiple rendering devices and operations, such as for example by a central server serving multiple clients.
  • In many embodiments, the approach may provide improved audio quality. For example, a reduced complexity may allow more resource to be allocated to more accurate rendering (including potentially of other audio sources in the environment). Further, in itself, the perception provided by multiple point sources may often be considered more accurate than that which is achieved by traditional approaches.
  • In the approach an audio source object to be rendered as emanating from a geometric extent object may be rendered with the spatial extension of the audio source object being represented using a plurality of individual audio point sources.
  • In some embodiments, the audio apparatus of FIG. 2 may for example be included in a rendering device such as for example the client 101 of FIG. 1. For example, the server 203 may generate an XR audio visual data stream which includes a mesh model for various objects in the environment/ scene. At least one of these objects may correspond to an audio object with spatial extent for which the audio visual data stream further comprises audio data.
  • The XR audio visual data stream may be transmitted to the client 101 which in the example includes the audio apparatus 200 of FIG. 2 as illustrated in the example of FIG. 3.
  • The audio apparatus 200 may perform the described operation to generate a modified representation of the spatial extent audio source object by a plurality of distributed point audio sources and associated point audio source data.
  • The client 101 may include a renderer 301 which is arranged to render audio for the scene. Specifically, the renderer 301 is arranged to include functionality for spatially rendering audio from point sources. It will be appreciated that many algorithms and operations for such rendering is known to the skilled person, including for example HRTF and BRIR based rendering for headphones, and that these for brevity will not be described further herein.
  • The renderer 301 may specifically receive a listener pose and for that pose render a point source audio signal from each point audio source generated for the audio source object. Thus, rather than the audio for the spatial extent signal being rendered as a single diffuse signal, it is rendered as a plurality of point source audio signals from different positions within the audio source object. The audio signal rendered for each point audio source signal may be rendered independently of any other audio signal rendering for any other point audio source position. The rendering of one point source audio signal may thus be based only on the point source audio signal associated with the point audio source position and on the point audio source position. Each point audio signal may be rendered independently of other point audio source positions and point audio signals. The generated point audio signals may then be combined for rendering to the user e.g. via loudspeakers or headphones.
  • Thus, each rendering may be performed separately and taking into account only the point audio source position and the listening position/ pose. No consideration of the geometric extent of the audio source object is needed for the rendering. Thus, although the approach requires multiple audio signals to be rendered from different positions for each audio source object, each rendering may use well-known and efficient rendering algorithms and approaches. Further, no evaluation of a geometric model is required for each listening position.
  • Not only may such an approach provide a very efficient audio processing and rendering but it has also been found that it provides highly advantageous and realistic spatial perception of the audio in many scenarios and embodiments. It may typically provide very accurate perception of the spatial extent of an audio source as perceived from the listening position.
  • The approach may represent an extent object as a number of discrete points, and then the standard object rendering method may be used to give the perception of source width without the need for decorrelation of the audio signals or for calculations of perceived width and decorrelation metrics, or for the storage of object surface descriptions.
  • In some embodiments, the audio apparatus may be part of an audio visual bitstream generator, such as for example an encoder or server. For example, the audio apparatus 200 may in some embodiments be part of the server 103 of FIG. 1.
  • In such a case the server 103 may receive or generate an audio visual data that includes a mesh model for various objects in an environment/ scene. At least one of these objects may correspond to a spatial extent audio object for which audio data is also provided/ generated.
  • Such data may be received by the audio apparatus 200 of the server 103 which may include elements as illustrated in FIG. 4. The audio apparatus of the server 103 may proceed to perform the described operations to generate a representation of the audio object using a plurality of point audio source positions and associated audio signal(s). This representation may then be provided to an encoding unit 401 arranged to generate an encoded bitstream comprising the data representation of the audio source object.
  • The encoder may encode the point audio source positions and point audio signals in any suitable form, and it will be appreciated that many approaches for encoding audio visual signals are known and appropriate. For example, the audio signals may be encoded using a known audio encoding algorithm and format, and the point audio source positions may be encoded as metadata associated with the individual point audio source signal.
  • Thus, the encoder, and in the specific example the server 103, may generate a audio bitstream that comprises a data representation of an audio source object having a spatial extent with the data representation comprising: a set of point audio source positions being distributed within the audio source object, and at least one point audio source signal for the single point audio positions from the first audio signal.
  • An advantage of the approach is in many embodiments that the point source based representation can be generated without any consideration of the specific listening position, i.e. the representation can be listening position independent.
  • Different approaches, principles, rules, and requirements may be used to generate the spatially distributed point audio source positions in different embodiments and scenarios.
  • In many embodiments, the position circuit 203 may be arranged to determine a set of intersect positions of a grid and to select the set of point audio source positions from the intersect positions. The grid may in many embodiments be a regular grid but may in some embodiments be an irregular or unstructured grid (such as e.g. a Ruppert's grid).
  • A regular grid may be a tessellation of n-dimensional space by parallelotopes and an irregular grid may be based on other shapes than parallelotopes. The intersect points may be corners/ vertices of the parallelotopes/ shapes. The intersect positions may be positions where lines or curves describing or defining the tessellation intersect. The grid may specifically be a Cartesian grid. In many embodiments, the grid may be a one, two, or three dimensional grid of Euclidian space.
  • In many embodiments, the point audio source positions may be determined as equidistant positions. The distance from a point audio source position to a nearest neighbor point audio source position may in some embodiments be constant for different point audio source positions.
  • As an example, in some embodiments, the position circuit 203 may align a predetermined regular grid with equidistant positions/ intersection points to the audio source object, and then evaluate the geometric model to identify the positions/ intersections that fall within the audio source object. These positions may then be selected as the point audio source positions for the audio source object. The alignment between the audio source object and the grid may in some embodiments be in accordance with a suitable algorithm or criteria (e.g. a reference position of the grid is aligned with e.g. a lowest/ highest etc. point of the audio source object; a reference position of the grid is positioned within the object, such as e.g. at a center). In most embodiments, an arbitrary alignment may be applied between the object and the grid (this may in particular be useful for embodiments where the distance between grid positions is much smaller than the spatial extent of the audio source object).
  • An example of such an approach is illustrated in FIG. 5. In the example, the audio source object is an elongated object in which only one row/ column of positions falls within the object. FIG. 6 illustrates an example for an object that has a significant extent in two directions but not in a third direction.
  • In some embodiments, the position circuit 203 may be arranged to determine a number of dimensions for which the audio source object has a perceived extent exceeding a threshold. For example, for the audio source object of FIG. 5, the position circuit 203 may determine that the audio source object only has a significant extent in only one direction, and thus that it is essentially a one dimensional object. For the object of FIG. 6, the position circuit 203 may determine that it is an object with significant spatial extent in two directions and thus essentially is a two dimensional object.
  • The position circuit 203 may then generate the set of point audio source positions as a structure having the same number of dimensions as determined by the position circuit 203. For example, a grid having the same number of dimensions as determined by the position circuit 203 may be used to determine the point audio source positions.
  • As a specific example, the audio apparatus may proceed through the following steps
    1. 1. Given any n-dimensional geometric object the audio apparatus may first check whether the dimensions are over a given threshold for rendering, reducing the object to the minimum number of dimensions required for rendering.
    2. 2. If the resulting object is 0-dimensional (i.e. a point source) then no further processing is required.
    3. 3. For a 1-dimensional (line) object, the line is subdivided into a number of discrete points, and each point is stored as a set of coordinates (e.g. corresponding to the example FIG. 5). For 2 or 3 dimensional objects each dimension is subdivided into a number of discrete points and a grid of points is created to cover the object uniformly (e.g. corresponding to the example FIG. 6).
    4. 4. All points are checked to ensure that they exist within or on the object surface and any points outside of the object are discarded.
    5. 5. The points are stored as a set of metadata defining the position and the corresponding audio signal to be associated with them, as well as possibly an audio presentation level reduction based on the number of additional sources that have been added.
    6. 6. The metadata is presented to the audio rendering technology which renders each object as a discrete point source, providing a perception of acoustic width.
  • In some embodiments, the position circuit 203 is arranged to determine the point audio source positions such that a distance from any of these positions to a boundary or surface of the audio source object is less than a distance. Thus, it may be required that a position is only included if it is sufficiently close to the surface of the object.
  • In such an example, the audio apparatus may check that the point audio source positions are located within a prescribed distance of the surface of the audio source object, and it may remove any positions that are in the center of the object.
  • This may for some objects substantially reduce the required processing for the rendering as the number of point audio sources that are rendered is reduced. However, in many scenarios, the spatial perception, and in particular the perception of the extent of the object, will not be substantially affected by the internal audio sources being excluded. Thus, an improved complexity and resource demand versus perceived audio quality can be achieved.
  • In many embodiments, the position circuit 203 may be arranged to determine the point audio source positions to satisfy one or more requirements.
  • For example, in some embodiments, point audio source positions may be determined as random positions, and these may then be evaluated to see if they meet a set of one or more requirements. For example, an iterative operation may be performed where a new point audio source position is randomly generated at each iteration. The random point audio source position is then evaluated in accordance with the set of requirements, and if all are met, the point audio source position is stored as part of the data representation for the audio source object, and otherwise it is discarded. The process may continue until e.g. a given number of point audio source positions have been determined (with the number potentially being determined based on a spatial property of the audio source object, such as based on a size or volume of the audio source object).
  • The position circuit 203 may be arranged to determine the set of point audio source positions to satisfy one, more or all of the following:
    • A requirement that a maximum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is less than a first distance. The position circuit 203 may determine the audio source positions such that the maximum distance between at least two points (and possibly of two points that meet the other requirements) is less than a given threshold. Thus, a given density of points may be ensured and a more homogenous perception of the entire audio source object can often be achieved.
    • A requirement that a minimum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is more than a second distance. The position circuit 203 may determine the audio source positions such that the minimum distance between two neighboring points (and possibly of two neighboring points that meet the other requirements) is more than a given threshold. Thus, the position circuit 203 may determine the point audio source positions such that the points are not too close together, and thus such that they may represent a larger region with fewer point audio source positions.
    • A requirement that a number of points of the set of point audio source positions does not exceed a given number. This may limit the total number of point audio source positions and thus may ensure that the resource requirement for the rendering is maintained sufficiently low (with the number potentially being determined based on a spatial property of the audio source object, such as based on a size or volume of the audio source object).
    • A requirement that a number of points of the set of point audio source positions does not fall below a given number (with the number potentially being determined based on a spatial property of the audio source object, such as based on a size or volume of the audio source object).
    • A requirement that a maximum distance from any point of a surface of the audio source object to a nearest point of the set of point audio source positions is less than a distance. As previously described, the point audio source positions may be determined to be close to the surface of the audio source object.
  • In some embodiments, the placement of point audio source positions is not controlled using a regular pattern, but instead the point audio source positions may be randomly generated within a certain set of rules, for example but not limited to, a minimum distance between individual sources, a maximum number of sources, or a minimum distance to the surface of the extent.
  • In some embodiments only a single audio signal is received for the audio source object. A point audio source signal for the point audio source positions may then be generated from this received audio signal and indeed in some embodiments the received audio signal may be used directly as the point audio source signal.
  • In other embodiments, some processing may be included such as for example a filtering or level adjustment.
  • In some embodiments, a plurality of point audio source signal may be generated for the point audio source positions for the audio source object. For example, slightly different audio signals may be generated for different parts of the audio source object (e.g. depending on how close they are to the surface or on which surface is the closest). The multiple point audio source signals may be generated from a single received audio signal or in some cases multiple point audio source signals may be generated from received multiple audio signals. When multiple point audio source signals are generated from an audio source object, each point audio source position is typically assigned one of the generated point audio source signals.
  • The point source audio signals may specifically be a duplication of the input audio signal, possibly with a level adjustment. If there is only one input audio signal, then all point audio source positions will typically be assigned the same point audio signal generator 205. If there are multiple input audio signals with associated metadata to indicate from which part of the audio source object they should be rendered, then the point audio source signal for the individual point audio source position may be determined based on audio signal that most closely matches the point audio source position.
  • In many embodiments, the level of the point audio source signals may typically be adapted. This adaptation may be included to compensate for the audio of the audio source object being represented by multiple audio sources. The audio levels of the point audio source signals may thus specifically depend on the number of point audio source positions and on their position. For example, in order to render the audio source object with a given audio level using a plurality of point audio sources, the level for each point audio source may be adjusted such that the combined effect of all point audio sources combine to provide the desired audio level.
  • In some embodiments, the point audio signal generator 205 may be arranged to determine at least one audio level for the set of point audio source positions in response to an audio level of audio source object and a number of positions in the set of point audio source positions. The level may further be determined in response to positions in the set of point audio source positions.
  • As a specific example, the original sound source level for the audio source object may be defined by a reference distance, with this being the distance at which the level of the audio signal is known either as an absolute reproduction level or a gain level that should be applied at that distance from the object.
  • In such an example, a number of measurement positions may be determined at the reference distance from the surface of the audio source object (see FIG. 7), and the audio level for the point audio source positions/ point audio source signals may be adjusted such that when all point audio source signals are rendered, the level received across all of the measurement positions is equal to that which would be received were a single object rendered at the nearest point on the audio source object surface to the measurement position. Such a determination may be performed using an automated optimization function or other estimation paradigm.
  • In some embodiments, the data generator 207 may be arranged to generate the data representation to include a relative render priority for the set of point audio source positions.
  • For example, for each point audio source position, metadata may be generated which indicates a relative priority of the point audio source position relative to other point audio source positions. The priority could for example be provided as a ranking of all of the point audio source positions in order of importance for the rendering in accordance with any suitable criterion. As another example, the point audio source positions may be allocated a rendering priority from a given set of possible rendering priorities. For example, each point audio source position may be indicated as being mandatory, preferred, or optional.
  • The renderer may in such embodiments select a set of point audio source positions from which to render audio signals in response to the relative render priority. The rendering for the audio source object may then be by rendering of the selected set of point audio source positions. The renderer may in some scenarios select a subset of point audio source positions to render based on the rendering priorities.
  • For example, if the renderer has limited resources and is only able to render a given number of audio signals, it may, if more point audio source positions are received than can be rendered, proceed to select a subset based on the rendering priorities. In this way, the rendering can be controlled to typically provide an improved audio experience.
  • For example, the rendering priority may be determined based on spatial relationships between the point audio source positions and possibly relative to spatial properties of the audio source object.
  • For example, all point audio source positions that are closest to a part of the surface of the audio source object may be indicated to be mandatory or have the highest rendering priority. Subsequently, a preferred rendering priority may be given to point audio source positions that have a given minimum distance to the point audio source positions that were given the highest rendering priority, and a given minimum distance to other point audio source positions that are given a preferred rendering priority. Finally, all other point audio source positions may be assigned an optional rendering priority.
  • A constrained renderer may in this case select which point audio source positions to render by first selecting all point audio source positions that are indicated to be mandatory, then point audio source positions that are indicated to be preferred, and finally if sufficient resource is still available may select point audio source positions that are indicated to be optional. If only some point audio source positions of a given category can be selected (e.g. due to resource constraints), a suitable selection criterion may be used (including e.g. merely randomly selecting point audio source positions within a given category, or e.g. selecting point audio source positions to have the maximum distance between them).
  • Such an approach may provide improved spatial audio perception for constrained rendering by e.g. ensuring that point audio source positions contributing most significantly to the perception of the spatial extent are rendered.
  • In some embodiments the priority indication may be dependent on a distance between the audio source object/ point audio source position and a listening position.
  • Such a priority may be used to reduce the number of point audio source positions that are rendered based on the user-source distance, i.e. the distance between the listening position and the audio source object/point audio source position. As the listening position moves further from the audio source object, the perceived relative width of the audio source object decreases and as such the number of points needed to give an adequate perceived width also decreases. Such an approach may allow a reduction in rendering complexity with increasing source distance. The renderer may for example determine an audio source object or point audio source position to listening position distance and then select the point audio source positions for rendering in response to this distance and the relative distance dependent rendering priorities.
  • In some embodiments, the data generator 207 may be arranged to generate the data representation to include directional sound propagation data for at least one position of the set of point audio source positions. The rendering may then render the audio signal from that point audio source position in response to the directional sound propagation data for the position.
  • In some embodiments, directional response data may be added to the metadata for each of the additional point audio source positions such that e.g. the reproduction when the listener is external to the audio source object may be controlled separately to the internal reproduction. Such a directional response may be described as a polar pattern, or some other representation known to the rendering technology.
  • For example, a gain as a function of direction may be provided for a point audio source position and when rendering a point audio signal from that point audio source position the gain of the rendered signal may be determined in response to the gain provided for the direction from the point audio source position to the listening position.
  • In some embodiments, the audio apparatus may be arranged to generate the point audio source positions based on a spatial relationship between the audio source object and a reference viewing/ listening region. The reference viewing/ listening region may specifically be a region to which the listener is constrained, or e.g. within which it is likely that the listener is positioned.
  • In same embodiments, the position circuit 203 may be arranged to determine the set of point audio source positions in response to a two dimensional extent of the audio source object when viewed from a given region relative to a position of the audio source object. In same embodiments, the position circuit 203 may be arranged to determine the set of point audio source positions in response to a representation of the audio source object when viewed from a given listening region relative to a position of the audio source object.
  • As an example, in an application the listener may be limited to a region of a virtual environment, and the audio apparatus may be aware of this region. The position circuit 203 may determine the maximum visible angle of the audio source object with respect to one or more positions within the listeners region and it may only consider the dimensions where the viewable angle exceeds a given threshold. This viewable angle can also inform the audio apparatus of the number of point audio source positions that should be used to provide the desired perceived source width.
  • In some embodiments, the data generator 207 may further be arranged to include an indication of a propagation time parameter for the set of point audio source positions.
  • The renderer may be arranged to render audio signals from the point audio source positions in response to the propagation time parameter. Specifically, in many embodiments, the renderer may be arranged to render the audio signals from all point audio source positions with the same propagation time, and with this propagation time being determined from the propagation time parameter.
  • Thus, although different point audio source positions will have different path lengths to the listening position, the audio sources are generated to have the same propagation time. The audio signals may be generated with other spatial cues reflecting the different positions of the point audio source positions. For example, differential delays between a right ear signal and a left ear signal (for HRTF processing) may reflect the actual position within the audio source object etc. Indeed, such an approach has been found to provide improved spatial perception in many scenarios and to in particular provide a perception of a cohesive audio source object with spatial extent.
  • In some embodiments, the distance dependent delay that is used by the renderer to simulate the time taken for the audio to propagate from the audio source object to the listener may be controlled by additional metadata linked to point audio source positions. The metadata may indicate that the time of flight for all sources should be considered to be the same as e.g. that of the nearest source to the listener or some other metric.
  • An extent of an audio source may be a spatial extension of an audio source. An extent audio source may be an audio source having a spatial extent. An extent audio source may be a spatially extended audio source. An audio source having an extent may be referred to as an audio source having a spatial extension.
  • It will be appreciated that the above description for clarity has described embodiments of the invention with reference to different functional circuits, units and processors. However, it will be apparent that any suitable distribution of functionality between different functional circuits, units or processors may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units or circuits are only to be seen as references to suitable means for providing the described functionality rather than indicative of a strict logical or physical structure or organization.
  • The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.
  • Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.
  • Furthermore, although individually listed, a plurality of means, elements, circuits or method steps may be implemented by e.g. a single circuit, unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate. Furthermore, the order of features in the claims do not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.

Claims (18)

  1. An audio apparatus comprising:
    a receiver (201) arranged to receive an audio signal and a geometric model for an audio source object;
    a position circuit (203) arranged to determine a set of point audio source positions for the audio source object in response to the geometric model, the point audio source positions being spatially distributed within the audio source object;
    a point audio signal generator (205) arranged to generate at least one point audio source signal for the point audio source positions from the audio signal; and
    a data generator (207) arranged to generate a data representation of the audio source object comprising the set of point audio source positions and the at least one point audio source signal.
  2. The audio apparatus of claim 1 wherein the position circuit (203) is arranged to determine the point audio source positions such that a distance from any position of the set of point audio source positions to a surface of the audio source object does not exceed a distance threshold.
  3. The audio apparatus of any previous claim wherein the position circuit (203) is arranged to determine a set of intersect positions of a grid and to select the set of point audio source positions from the intersect positions.
  4. The audio apparatus of any previous claim wherein the position circuit (203) is arranged to determine the set of point audio source positions to satisfy at least one requirement of:
    a requirement that a maximum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is less than a first distance threshold;
    a requirement that a minimum distance between each position of the set of point audio source positions and a nearest position of the set of point audio source positions is more than a second distance threshold;
    a requirement that a number of points of the set of point audio source positions does not exceed a first number;
    a requirement that a number of points of the set of point audio source positions is not below a second number; and
    a requirement that a maximum distance from any point of a surface of the audio source object to a nearest point of the set of point audio source positions is less than a third distance threshold.
  5. The audio apparatus of any previous claim wherein the point audio signal generator (205) is arranged to determine at least one audio level for the set of point audio source positions in response to an audio level for audio source object and a number of positions in the set of point audio source positions.
  6. The audio apparatus of claim 5 wherein the point audio signal generator (205) is arranged to determine the at least one audio level in response to positions in the set of point audio source positions.
  7. The audio apparatus of any previous claim wherein the data generator (209) is arranged to generate the data representation to include a relative render priority for the set of point audio source positions.
  8. The audio apparatus of any previous claim wherein the data generator (209) is arranged to generate the data representation to include directional sound propagation data for at least one position of the set of point audio source positions.
  9. The audio apparatus of any previous claim wherein the position circuit (203) is arranged to determine the set of point audio source positions in response to a two dimensional extent of the audio source object when viewed from a given region relative to a position of the audio source object.
  10. The audio apparatus of any previous claim wherein the position circuit (203) is arranged to determine a number of dimensions for which the audio source object has an extent exceeding a threshold, and to generate the set of point audio source positions as a structure having the number of dimensions.
  11. The audio apparatus of any previous claim wherein the data generator (207) is arranged to generate the data representation to include an indication of a propagation time parameter for the set of point audio source positions.
  12. The audio apparatus of any previous claim further comprising a renderer (301) arranged to render the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  13. The audio apparatus of any previous claim further comprising an encoder (401) for generating an encoded bitstream comprising the data representation of the audio source object.
  14. An audio apparatus comprising:
    a receiver for receiving an audio bitstream comprising a data representation of an audio source object having a spatial extent, the data representation comprising:
    a set of point audio source positions being distributed within the audio source object, and
    at least one point audio source signal for the point audio source positions; and
    a renderer arranged to render the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  15. A method of operation for an audio apparatus, the method comprising
    receiving an audio signal and a geometric model for an audio source object;
    determining a set of point audio source positions for the audio source object in response to the geometric model, the point audio source positions being spatially distributed within the audio source object;
    generating at least one point audio source signal for the point audio source positions from the audio signal; and
    generating a data representation of the audio source object comprising the set of point audio source positions and the at least one point audio source signal.
  16. A method of operation for an audio apparatus, the method comprising:
    receiving an audio bitstream comprising a data representation of an audio source object having a spatial extent, the data representation comprising:
    a set of point audio source positions being distributed within the audio source object, and
    at least one point audio source signal for the point audio source positions; and
    rendering the audio source object by rendering the at least one point audio source signal from the set of point audio source positions.
  17. A computer program product comprising computer program code means adapted to perform all the steps of claim 15 or 16 when said program is run on a computer.
  18. An audio bitstream comprising a data representation of an audio source object having a spatial extent; the data representation comprising
    a set of point audio source positions being distributed within the audio source object; and at least one point audio source signal for the point audio source positions from the audio signal.
EP22150861.7A 2022-01-11 2022-01-11 Audio apparatus and method of operation therefor Pending EP4210352A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22150861.7A EP4210352A1 (en) 2022-01-11 2022-01-11 Audio apparatus and method of operation therefor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP22150861.7A EP4210352A1 (en) 2022-01-11 2022-01-11 Audio apparatus and method of operation therefor

Publications (1)

Publication Number Publication Date
EP4210352A1 true EP4210352A1 (en) 2023-07-12

Family

ID=79316939

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22150861.7A Pending EP4210352A1 (en) 2022-01-11 2022-01-11 Audio apparatus and method of operation therefor

Country Status (1)

Country Link
EP (1) EP4210352A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060165238A1 (en) * 2002-10-14 2006-07-27 Jens Spille Method for coding and decoding the wideness of a sound source in an audio scene
US9654895B2 (en) 2013-07-31 2017-05-16 Dolby Laboratories Licensing Corporation Processing spatially diffuse or large audio objects
US20200145773A1 (en) * 2017-05-04 2020-05-07 Dolby International Ab Rendering audio objects having apparent size

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060165238A1 (en) * 2002-10-14 2006-07-27 Jens Spille Method for coding and decoding the wideness of a sound source in an audio scene
US9654895B2 (en) 2013-07-31 2017-05-16 Dolby Laboratories Licensing Corporation Processing spatially diffuse or large audio objects
US20170223476A1 (en) * 2013-07-31 2017-08-03 Dolby International Ab Processing Spatially Diffuse or Large Audio Objects
US20200145773A1 (en) * 2017-05-04 2020-05-07 Dolby International Ab Rendering audio objects having apparent size

Similar Documents

Publication Publication Date Title
CN112205005B (en) Adapting acoustic rendering to image-based objects
JP2023158059A (en) Spatial audio for interactive audio environments
CN112602053B (en) Audio device and audio processing method
JP7453248B2 (en) Audio equipment and methods of processing it
JP2022500917A (en) Equipment and methods for processing audiovisual data
EP4066236B1 (en) Apparatus and method for determining virtual sound sources
EP4210352A1 (en) Audio apparatus and method of operation therefor
US20230377276A1 (en) Audiovisual rendering apparatus and method of operation therefor
US20220036075A1 (en) A system for controlling audio-capable connected devices in mixed reality environments
EP4132012A1 (en) Determining virtual audio source positions
EP4210353A1 (en) An audio apparatus and method of operation therefor
WO2023199815A1 (en) Acoustic processing device, program, and acoustic processing system
EP4381746A1 (en) Determining virtual audio source positions
WO2023199673A1 (en) Stereophonic sound processing method, stereophonic sound processing device, and program
WO2023147978A1 (en) Virtual content
KR20220077014A (en) 360-degree dome screening method based on virtual reality technology
CN117223299A (en) Method, apparatus and system for modeling audio objects with range

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN