EP4207816A1

EP4207816A1 - Audio processing

Info

Publication number: EP4207816A1
Application number: EP21218297.6A
Authority: EP
Inventors: Lasse Juhani Laaksonen; Arto Juhani Lehtiniemi; Antti Johannes Eronen
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2023-07-05

Abstract

An apparatus, method and computer program is disclosed relating to audio processing. A method. In an example embodiment, the method may comprising tracking a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space. The method may also comprise identifying a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region. Responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, the method may also comprise modifying the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene.

Description

Technical Field

Example embodiments relate to audio processing, for example audio processing which modifies respective perceived spatial positions of audio sources in a spatial audio scene to counter movement of a user position.

Background

Spatial audio refers to audio which, when output to a user device such as a pair of earphones, enables a user to perceive one or more audio sources as coming from respective directions with respect to the user's position. For example, one audio source may be perceived as coming from a position in front of the user whereas other audio sources may be perceived as coming from positions to the left and right -hand sides of the user. In spatial audio, the user may perceive such audio sources as coming from positions external to the user's position, in contrast to, for example, stereoscopic audio in which audio is effectively perceived within the user's head and where an audio source may be panned between the ears or, in some cases, played back to one ear only. Spatial audio may therefore provide a more life-like and immersive user experience.

Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
According to a first aspect, there is described an apparatus comprising means for: tracking a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space; identifying a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region; and responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, modifying the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene.
It is to be noted that, in the context of the first subset and the second subset of the audio sources, a subset may comprise one audio source.
The first orientation may correspond to the orientation of the user's head and the first region may correspond to a region to the front of the user's head.
The first region may comprise a sector to the front of the user's head having a central angle of less than 180 degrees. The central angle may be between 30 and 60 degrees.
The first position may correspond to the first orientation and the second position may corresponds to a second orientation, wherein the tracking means may be configured to track angular movement between the first and second orientations and the modifying means may be configured to modify the respective perceived spatial positions of the first subset of the audio sources if the tracked angular movement is within a predefined angular range of movement.
The predefined angular range may corresponds with the central angle of the first region sector.
The first and second positions may correspond to respective first and second spatial positions of the real-world space, wherein the tracking means may be configured to track translational movement between the first and second spatial positions and the modifying means may be configured to modify the respective perceived spatial positions of the first subset of the audio sources if the tracked translational movement is within a predefined translational range of movement.
The modifying means may be further configured, responsive to the tracked movement going beyond the predefined range of movement, to disable further modification of the respective perceived spatial positions of the first subset of the audio sources.
The apparatus may further comprise means for: determining that, subsequent to movement of the user position from the first position to the second position within the real-world space, the tracked user movement over a predetermined time period is below a predetermined movement amount; and updating, in response to said determination, the first region of the real-world space such that it is defined with respect to the orientation of the user at the second position.
The apparatus may further comprise means for updating the respective perceived spatial positions of the first subset of the audio sources such that they are returned to their previous respective spatial positions in the spatial audio scene with respect to the second subset of the audio sources.
The apparatus may further comprise means for: identifying that the first subset of audio sources, comprising a plurality of audio sources, are of interest to the user based on one or more tracked movement characteristics; and responsive to said identification, modifying the respective perceived spatial positions of the first subset of audio sources such that they are perceived as more spatially spread apart, at least temporarily.
The identifying means may be configured to identify the one or more tracked movement characteristics as movements that cycle between limits of the predefined range of movements.
The amount of modification to spatially spread apart the respective perceived spatial positions may be based on the distance of at least one of the first subset of the audio sources from the position of the user.
Limits of the first region and/or the permissible range of movement may be dynamically changeable based on the distance of at least one of the first subset of the audio sources from the position of the user.
The data representing the spatial audio scene may be representative of a musical performance.
The audio output device may comprise a set of earphones.
The modifying means may be configured to perform said modification in response to detecting that the data representing the spatial audio data scene is representative of a particular type of spatial audio scene and/or has associated metadata indicative that said modification is to be performed for the spatial audio scene.
According to a second aspect, there is described a method comprising: tracking a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space; identifying a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region; and responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, modifying the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene.
The first orientation may correspond to the orientation of the user's head and the first region may correspond to a region to the front of the user's head.
The first region may comprise a sector to the front of the user's head having a central angle of less than 180 degrees. The central angle may be between 30 and 60 degrees.
The first position may correspond to the first orientation and the second position may corresponds to a second orientation, wherein the tracking may comprise tracking angular movement between the first and second orientations and the modifying may comprise modifying the respective perceived spatial positions of the first subset of the audio sources if the tracked angular movement is within a predefined angular range of movement.
The predefined angular range may corresponds with the central angle of the first region sector.
The first and second positions may correspond to respective first and second spatial positions of the real-world space, wherein the tracking may comprise tracking translational movement between the first and second spatial positions and the modifying may comprise modifying the respective perceived spatial positions of the first subset of the audio sources if the tracked translational movement is within a predefined translational range of movement. The modifying may further comprise, responsive to the tracked movement going beyond the predefined range of movement, disabling further modification of the respective perceived spatial positions of the first subset of the audio sources.
The method may further comprise: determining that, subsequent to movement of the user position from the first position to the second position within the real-world space, the tracked user movement over a predetermined time period is below a predetermined movement amount; and updating, in response to said determination, the first region of the real-world space such that it is defined with respect to the orientation of the user at the second position.
The method may further comprise updating the respective perceived spatial positions of the first subset of the audio sources such that they are returned to their previous respective spatial positions in the spatial audio scene with respect to the second subset of the audio sources.
The method may further comprise: identifying that the first subset of audio sources, comprising a plurality of audio sources, are of interest to the user based on one or more tracked movement characteristics; and responsive to said identification, modifying the respective perceived spatial positions of the first subset of audio sources such that they are perceived as more spatially spread apart, at least temporarily.
The identified one or more tracked movement characteristics may be movements that cycle between limits of the predefined range of movements.
The amount of modification to spatially spread apart the respective perceived spatial positions may be based on the distance of at least one of the first subset of the audio sources from the position of the user.
Limits of the first region and/or the permissible range of movement may be dynamically changeable based on the distance of at least one of the first subset of the audio sources from the position of the user.
The data representing the spatial audio scene may be representative of a musical performance.
The audio output device may comprise a set of earphones.
Modifying may comprise performing said modification in response to detecting that the data representing the spatial audio data scene is representative of a particular type of spatial audio scene and/or has associated metadata indicative that said modification is to be performed for the spatial audio scene.
According to a third aspect, there is provided a computer program product comprising a set of instructions which, when executed on an apparatus, is configured to cause the apparatus to carry out the method of any preceding method definition.
According to a fourth aspect, there is provided a non-transitory computer readable medium comprising program instructions stored thereon for performing a method, comprising: tracking a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space; identifying a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region; and responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, modifying the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene..
The program instructions of the fourth aspect may also perform operations according to any preceding method definition of the second aspect.
According to a fifth aspect, there is provided an apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to: track a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space; identify a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region; and responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, modify the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene..
The computer program code of the fifth aspect may also perform operations according to any preceding method definition of the second aspect.

Brief Description of the Drawings

Example embodiments will now be described with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a system 100 which may be useful for understanding example embodiments.
FIGs. 2A - 2C are front views of a user wearing the set of earphones which may be useful for understanding example embodiments;
FIG. 3 is a flow diagram showing processing operations according to example embodiments;
FIGs. 4A - 4C are top-plan views of a user including an indication of a spatial audio scene comprising a plurality of audio sources and how respective spatial positions of the audio sources may be modified according to some example embodiments;
FIGs. 5A- 5B are top-plan views of the user in the FIGs. 4A- 4C spatial audio scene for indicating a reset operation according to some example embodiments;
FIGs. 6A - 6C are top-plan views of the user in the FIGs. 4A - 4C spatial audio scene for indicating an audio source spreading effect according to some example embodiments;
FIGs. 7A and 7B are top-plan views of the user in FIGs. 4A - 4C spatial audio scene for indicating another type of audio source spreading effect according to some example embodiments;
FIGs. 8A - 8C are front views of a user for indicating how respective spatial positions of the audio sources may be modified due to rotational movement according to some example embodiments;
FIGs. 9A - 9C are front views of a user for indicating how respective spatial positions of the audio sources may be modified due to translational movement according to some example embodiments;
FIG.10 is a schematic view of an apparatus in which example embodiments may be embodied; and
FIG. 11 is a plan view of a non-transitory medium which may store computer-readable code for causing an apparatus, such as the FIG. 10 apparatus, to perform operations according to example embodiments.

Detailed Description

In the description and drawings, like reference numerals refer to like elements throughout.
Example embodiments relate to an apparatus, method and computer program for audio processing, for example audio processing which modifies respective perceived spatial positions of one or more audio sources in a spatial audio scene to counter movement of a user position. More generally, example embodiments relate to audio processing in the field of spatial audio in which data, which may be referred to as spatial audio data, encodes a so-called spatial audio scene. When the spatial audio data is rendered and output as audio to a user device such as a pair of earphones or similar, it enables a user to perceive one or more audio sources as coming from respective positions, e.g. directions, with respect to the user's position. In spatial audio, the user may perceive such audio sources as coming from positions external to the user's position, in contrast to stereoscopic audio in which audio is effectively perceived within the user's head. Spatial audio may therefore provide a more life-like and immersive user experience.
Example formats for spatial audio data may include, but are not limited to, multi-channel mixes such as 5.1 or 7.1+4, Ambisonics, parametric spatial audio (e.g., metadata-assisted spatial audio (MASA)), object-based audio, or any combination thereof.
Spatial audio rendering may be characterized also in terms of so-called degrees-of-freedom (DoF.) For example, if a user's head rotation affects rendering and therefore output of the spatial audio, this may be referred to as 3DoF audio rendering. If a change in the user's spatial position also affects rendering, this may be referred to as 6DoF audio rendering. Sometimes, the term 3DoF+ rendering may be used to indicate a limited effect of a user's change in spatial position, for example to account for a limited amount of translational movement of the user's head when the user is otherwise stationary. Taking into account this type of movement is known to improve, e.g., externalization of spatial audio sources.
A user's position maybe determined in real-time or near real-time, i.e. tracked, using one or more known methods. For example, a user's position may correspond to the position of the user's head. In this sense, the term "position" may refer to orientation, i.e. a first orientation of the user's head is a first position and a second, different orientation of the user's head is a second position. The term may also refer to spatial position within a real-world space to account for translational movements of the user's head which may or may not accompany translational movements of the user's body.
The position of the user's head may be determined using one or more known head-tracking methods, such as by use of one or more cameras which identify facial features in real-time, by use of inertial sensors (gyroscopes/ accelerometers) within a head-worn device, such as a set of earphones used to output the spatial audio data, satellite position systems such as by use of the Global Navigation Satellite System (GNSS) or other forms of position determination means, to give but some examples.
The spatial audio data may be output to a head-worn device such as a pair of earphones, which term is intended to cover devices such as a pair of earbuds, on or over -ear headphones, and also speakers within a worn headset such as an extended reality (XR) headset. In this context, an XR headset may incorporate one or more display screens for presenting video data which may represent part of a virtual video scene. For example, the video data when rendered may present one or more visual objects corresponding to one or more audio sources in the audio scene.
In use, a user will be located in a real-world space which may be an indoor space, for example a room or hall, or possibly an outdoor space. The real-world space is to be distinguished over a virtual space that is output to the user device (such as the above-mentioned head-worn device) and therefore perceived by the user.
In some example embodiments, a spatial audio scene is a form of virtual space comprising one or more audio sources coming from respective positions with respect to a user's current position in the real-world space. The spatial audio scene may, for example, represent a musical performance and the one or more audio sources may represent different performers and/or instruments that can be perceived from respective spatial positions with respect to the user's current spatial position and orientation in the real-world space. The spatial audio scene is not, however, limited to musical performances.
As a user moves within the real-world space, a so-called head-tracking effect may be applied by processing of spatial audio data that represents the spatial audio scene. Head tracking may involve processing and rendering spatial audio data such that respective perceived spatial positions of audio sources are modified to counter user movements. In other words, rather than the respective perceived positions of audio sources remaining fixed in relation to the position of the user's head as it moves (as in the case of stereoscopic audio) the positions are modified in a counter, or opposite, manner; this gives the perception that the audio sources remain in their original position even though the user has changed position, whether via rotation or translation.
For example, if the user rotates their head clockwise by a degrees, the spatial audio data may be modified so that respective perceived spatial positions of audio sources are moved counter-clockwise, for example also by a degrees. Alternatively, or additionally, if the user moves their head in translation, e.g. five centimetres to the left, the spatial audio data may be modified so that the respective perceived spatial positions of audio sources are moved to the right, for example also by five centimetres. The amount of counter-clockwise movement is not necessarily the same as the translational movement.
This may provide several benefits for the user experience. For example, the user may be able to determine more precisely the direction of a given audio source in the spatial audio scene. The user may better experience externalization and immersion.
However, in certain scenarios such as where the artistic or creative intent for the spatial audio scene is for significant audio sources to be heard from a particular position with respect to the user, e.g. a music performance where significant vocals and/or instruments are to be heard generally to the front of the user, head tracking effects may confuse the experience.
Example embodiments are directed at mitigating or avoiding such issues to provide a more optimal user experience when consuming spatial audio.
FIG. 1 is a block diagram of a system 100 which may be useful for understanding example embodiments.
The system 100 may comprise a server 110, a media player 120, a network 130 and a set of earphones 140.
The server 110 may be connected to the media player 120 by means of the network 130 for sending data, e.g., spatial audio data, to the media player 120. The server 110 may for example send the data to the media player 120 responsive to one or more data requests sent by the media player 120. For example, the media player 120 may transmit to the server 110 an indication of a position associated with a user of the media player 120, and the server may process and transmit back to the media player 120 spatial audio data responsive to the received position, which may be in real-time or near real-time. This may be by means of any suitable streaming data protocol. Alternatively, or additionally, the server 110 may provide one or more files representing spatial audio data to the media player 120 for storage and processing thereat. At the media player 120, the spatial audio data may be processed, rendered and output to the set of earphones 140. In example embodiments, the set of earphones 140 may comprise head tracking sensors for indicating to the media player 120, using any suitable method, a current position of the user, e.g., one or both of the orientation and spatial position of the user's head, in order to determine how the spatial audio data is to be rendered and output.
In some embodiments, the media player 120 may comprise part of the set of earphones 140.
The network may be any suitable data communications network including, for example, one or more of a radio access network (RAN) whereby communication is via one or more base stations, a WiFi network whereby communications is via one or more access points, or a short-range network such as one using the Bluetooth or Zigbee protocol.
FIGs. 2A - 2C are representational drawings of a user 210 wearing the set of earphones 140 which may also be useful for understanding example embodiments. Referring to FIG. 2A, the user 210 is shown listening to a rendered audio field comprised of first to fourth audio sources (collectively indicated by reference numeral 220), e.g., corresponding to distinct respective sounds labelled "1", "2", "3" and "4."
Referring to FIG. 2B, the respective perceived spatial positions of the first to fourth audio sources 220 are indicated with respect to the user's head for the case that the audio data is spatial audio data but with no head tracking effect implemented. It will be seen, for example, that counter-clockwise rotation of the user's head results in no modification of the spatial audio data because the respective perceived spatial positions of the first to fourth audio sources 220 follow the user's movement.
Referring to FIG. 2C, the respective perceived spatial positions of the first to fourth audio sources 220 are indicated with respect to the user's head for the case that the audio data is spatial audio data and the above-mentioned head tracking effect is implemented. It will be seen, in this case, that counter-clockwise rotation results of the user's head results in modification of the spatial audio data because the respective perceived spatial positions of the first to fourth audio sources 220 are relatively static in the audio field even though the user is moving.
FIG. 3 is a flow diagram showing processing operations, indicated generally by reference numeral 300, according to example embodiments. The processing operations 300 may be performed in hardware, software, firmware, or a combination thereof. For example, the processing operations may be performed by the media player 120 shown in FIG. 1.
A first operation 302 may comprise tracking a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space.
A second operation 304 may comprise identifying a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region.
It is to be noted that, in the context of the first subset and the second subset of the audio sources, a subset may comprise one audio source.
A third operation 306 may comprise, responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, modifying the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene.
Example embodiments may therefore provide a form of partial head tracking in the sense that, when a user position moves, e.g. from the first position to the second position within the real-world space, the head tracking effect is applied to the first subset of audio sources and not the second subset of audio sources, as will be explained below with the help of visual examples.
For example, spatial audio data representing a musical performance with clear left - right balance and separation, with important content in the centre or front, may have this partial head tracking applied to audio sources to the front of the user to allow a degree of immersion and localisation whilst maintaining artistic intent and avoiding confusing effects.
In some example embodiments, the first orientation may correspond to the orientation of the user's head and the first region may correspond to a region to the front of the user's head.
For example, the first region may comprise a sector to the front of the user's head having a central angle of less than 180 degrees. For example, the central angle may be between 30 and 60 degrees but may vary depending on application.
In some example embodiments, the movement of the user position from the first position to the second position may be an angular movement.
For example, the first position may correspond to the first orientation and the second position may correspond to a second orientation. Angular movement may be tracked between the first and second orientations and the respective perceived spatial positions of the first subset of the audio sources may be modified if the tracked angular movement is within a predefined angular range of movement.
The predefined angular range may correspond with the central angle of the first region sector, e.g. both may be 30 degrees, but they need not be the same.
To illustrate, FIGs. 4A- 4C are plan views of a user 402 in real-world space together with an indication of a spatial audio scene that the user perceives through a pair of earphones 404. Referring to FIG. 4A, the spatial audio scene may comprise first to fourth audio sources 410, 412, 414, 416 perceived from respective spatial positions with respect to the user's shown orientation 406.
A first region is defined in the form of a sector 408 having in this example a central angle of 30 degrees and this angle may also correspond with a predefined angular range of movement through which partial head tracking may be performed. This equates to 15 degrees angular movement either side of the user's current orientation 406.
It will be seen that the first audio source 410 corresponds to the sector 408 and hence head tracking may be performed in respect of this audio source and not performed in respect of the second to fourth audio sources 412, 414, 416.
Referring to FIG. 4B, as the tracking operation tracks the user's orientation as their head moves clockwise by 15 degrees, the perceived spatial position of the first audio source 410 is modified so that it is static in the spatial audio scene whereas the respective perceived spatial positions of the second to fourth audio sources 412, 414, 416, outside of the sector 408, are unmodified (in this sense) and move with the user's change in orientation. The user is thus able to better localise the position of the first audio source 410.
In the modifying operation, responsive to the tracked movement going beyond the predefined range of movement, which in this example is 15 degrees to one side of the FIG. 4A orientation 406, further modification of the first subset of the audio sources may be disabled. To illustrate, and with reference to FIG. 4C, as the tracking operation tracks a change in the user's orientation by a further 15 degrees, the perceived spatial position of the first audio source 410 moves 15 degrees with the user's change in orientation, as do those of the second to fourth audio sources 412, 414, 416 and hence the spatial relationship between the first to fourth audio sources remains fixed. The permitted 15-degree range of movement is indicated by region 420.
In some example embodiments, after tracked movement of the user position from the first position to the second position, the tracked user movement may be below a predetermined movement amount for a predetermined time period. For example, the user may remain relatively still in the second position, or move by an amount within a predetermined movement threshold, and this may be the case for greater than, say, 10 seconds.
In response to identifying such relative stability, the first region of the real-world space may be updated such that it is defined with respect to the orientation of the user at the second position. This may be termed a first region reset.
To illustrate, FIG. 5A corresponds with FIG. 4C described above. If the user 402 remains at the shown orientation 502 for a predetermined time period, or moves very little from the shown orientation such that movement is within a predetermined movement amount (e.g. less than 2 degrees) for said predetermined time period (e.g. 10 seconds), a first region reset may be initiated. Referring to FIG. 5B, this may involve moving the perceived spatial position of the first audio source 410 based on the second position, in this case to become aligned with the shown orientation 502. The first region is then updated such that, in this case, it becomes the new front sector 508 having a central angle of 30 degrees. In general, the respective perceived spatial positions of the first subset of the audio sources 410 may be returned to their previous respective spatial positions in the spatial audio scene with respect to the second subset of the audio sources, comprising the second to fourth audio sources 412, 414, 416 such that the scene orientation corresponds to that shown in FIG. 4A, assuming the audio source orientations themselves have not dynamically changed. The new front sector 508 then defines the first subset of the audio sources and the second subset of the audio sources for future partial head tracking operations. The dashed circle 510 indicates the previous perceived spatial position of the first audio source 410.
In some example embodiments, further operations may comprise identifying that the first subset of audio sources, when comprising a plurality of audio sources, are of interest to the user based on one or more tracked movement characteristics, and responsive to said identification, modifying the respective perceived spatial positions of the first subset of audio sources such that they are perceived as more spatially spread apart. In this way, the user may perceive the audio sources better because they are spatially spread apart.
In some example embodiments, identifying the first subset of the audio sources as being of interest to the user may comprise identifying one or more tracked movement characteristics as movements that cycle between limits of the predefined range of movement.
To illustrate, FIGs. 6A- 6C are plan views of a user 602 in a real-world space together with an indication of a spatial audio scene that the user perceives through a pair of earphones 604. Referring to FIG. 6A, the spatial audio scene may comprise first to sixth audio sources 610, 612, 614, 616, 618, 620 perceived from respective spatial positions with respect to the user's shown orientation 606.
A first region is defined in the form of a sector 608 having a central angle of 30 degrees, as before, and this may also correspond with a predefined angular range of movement through which partial head tracking may be performed. This may equate to 15 degrees angular movement either side of the user's current orientation.
It will be seen that the first audio source 610 and the second audio source 612 correspond to the sector 608 and hence head tracking may be performed in respect of these audio sources and not performed in respect of the third to sixth audio sources 614, 616, 618, 620.
Referring to FIG. 6B, it will be seen that the user 602 may change orientation in a cyclical manner, i.e. first counter-clockwise and then clockwise, within and possibly around (e.g. possibly larger) the predefined range of movements.
Referring then to FIG. 6C, this may trigger the first audio source 610 and the second audio source 612 to be spatially separated so that they may be perceived better by the user 602. This separation may occur temporarily, e.g. for a predetermined time period. This separation may also allow the predetermined range of movements over which head tracking may be performed to be relaxed, at least temporarily, and in respect of at least one of the first audio source 610 and the second audio source 612.
In some embodiments, the amount of modification applied in order to spatially spread apart the respective perceived spatial positions of the first sector audio sources may be based on the distance of at least one of the audio sources 610, 612 from the position of the user 602.
For example, FIG. 7A shows a partial plan view of a user 702 in real-world space together with an indication of part of a spatial audio scene, perceived through a pair of earphones 704. The spatial audio scene may include first, second and third audio sources 710, 712, 714 within the first sector 708. Referring to FIG. 7B, responsive to identifying that the first to third audio sources 710, 712, 714 are of interest, it will be seen that the first audio source 710, which is the closest to the user 702, remains relatively static, whereas the second and third audio sources 712, 714 are spread apart from the first audio source by different amounts based on their respective distance from the user 702.
In some embodiments, identification that the first subset of audio sources are of interest to the user may be an alternative or additional trigger to perform a first region reset as described previously with reference to FIGs. 5A- 5C.
In some embodiments, the limits of the first region and the predefined range of movement, may be fixed. In some embodiments, the limits of one or both may change, by user input and/or adaptively.
For example, a device implementing example embodiments may analyse a complexity and/or spatial closeness of the first subset of audio sources and the second set of audio sources and then update the limits (and therefore the size) of the first region and/or the predefined range of movements based on such analysis.
Additionally, or alternatively, such limits may be based on values that are part of a description associated with the spatial audio data, e.g., in metadata. This may be based, in the case of a musical performance, on an artist recommendation or service preference settings. The metadata can be time-varying metadata, wherein the limits may vary during the course of the music performance's runtime. The metadata may comprise contextdependent modifications. For example, a detected user environment or activity can be taken into account when determining the limits to be used. For example, if a user is considered to be focusing on certain audio sources, more freedom for head-tracking may be allowed by widening the first region and/or the predefined range of movements.
User preferences may also be taken into account. For example, a user may select whether to use, at a given time, no head tracking, full head tracking, or partial head tracking as described herein.
Example embodiments have so far focussed on 3DoF head tracking. Example embodiments may be extended to account for 3DoF+ in which some amount of translational movement may be additionally or alternatively tracked.
To generalise, the first and second positions referred to with regard to the operations in FIG. 3 may correspond to respective first and second spatial positions of the real-world space. Translational movement may be tracked between the first and second spatial positions and the respective perceived spatial positions of the first subset of the audio sources may be modified if the tracked translational movement is within a predefined translational range of movement. For example, the predefined translational range of movement may be up to 10 centimetres but it may be greater.
In such a case, it will also be appreciated that adaptation of the limits of the first region and/or the predefined range of movement may be based on the distance of at least one of the first subset of the audio sources from the position of the user. Distance information may form part of the metadata within spatial audio data, i.e. indicating respective distances or depths associated with audio sources in the audio scene.
FIGs. 8A - 8C show front views of a user 802 when consuming spatial audio using a pair of earphones 804, in which the represented audio scene comprises first to fourth audio sources 810, 812, 814, 816.
More specifically, FIG. 8A shows the user 802 in a first orientation and it may be assumed that, in accordance with example embodiments, the first and fourth audio sources 8 10, 8 16 correspond to a first sector region (within the meaning of the FIG. 3 operations) and therefore comprise the first subset of the audio sources. The second and third audio sources 812, 814 comprise the second subset of the audio sources. As shown in FIG. 8B, with clockwise movement of the user 802 within a predetermined range of movement, e.g. up to 15 degrees to one side, the first and fourth audio sources 810, 816 are perceived as static, because head tracking is applied, whereas the second and third audio sources 812, 814 move with the user. Referring to FIG. 8C, the same happens for counter-clockwise movement of the user 802. The user 802 is better able to localise the first and second audio sources 8 10, 816.
For completeness, a translational movement case will now be shown with reference to FIGs. 9A- 9C. FIG. 9Ais the same as FIG. 8A and it maybe assumed that, in accordance with example embodiments, the first and fourth audio sources 810, 816 correspond to a first sector region (within the meaning of the FIG. 3 operations) and therefore comprise the first subset of the audio sources. The second and third audio sources 812, 814 comprise the second subset of the audio sources. As shown in FIG. 9B, with rightwards translational movement of the user 802 within a predetermined range of movement, e.g. up to 10 centimetres degrees to one side, the first and fourth audio sources 8 10, 816 are perceived as static, whereas the second and third audio sources 8 12, 8 14 move rightwards with the user. Referring to FIG. 9C, the same happens for leftwards translational movement of the user 802. The user 802 is again better able to localise the first and second audio sources 810, 816. In FIGs. 8A- 8C and 9A- 9C, the dashed lines indicate the respective spatial positions of audio sources that would result without head tracking.
In some example embodiments, such partial head tracking may be performed in response to detecting one or more predetermined conditions associated with spatial audio data. For example, spatial audio data may be processed and rendered by default with no head tracking effect or a full head tracking effect unless the one or more predetermined conditions are met, in which case a partial head tracking effect is performed. Examples of the one or more predetermined conditions include, but are not limited to, detecting that the spatial audio scene (a) is representative of a particular type of spatial audio scene or content, e.g. a musical performance, (b) is received from a particular source of data, e.g. a music streaming service, and/or (c) has associated metadata indicative that partial head tracking is to be performed.

Example Apparatus

FIG. 10 shows an apparatus according to some example embodiments. The apparatus may be configured to perform the operations described herein, for example operations described with reference to any disclosed process. The apparatus comprises at least one processor 1000 and at least one memory 1001 directly or closely connected to the processor. The memory 1001 includes at least one random access memory (RAM) 1001a and at least one read-only memory (ROM) 1001b. Computer program code (software) 1005 is stored in the ROM 1001b. The apparatus may be connected to a transmitter (TX) and a receiver (RX). The apparatus may, optionally, be connected with a user interface (UI) for instructing the apparatus and/or for outputting data. The at least one processor 1000, with the at least one memory 1001 and the computer program code 1005 are arranged to cause the apparatus to at least perform at least the method according to any preceding process, for example as disclosed in relation to the flow diagrams herein.
FIG. 11 shows a non-transitory media 1100 according to some embodiments. The non-transitory media 1100 is a computer readable storage medium. It may be e.g. a CD, a DVD, a USB stick, a blu-ray disk, etc. The non-transitory media 1100 stores computer program code, causing an apparatus to perform the method of any preceding process for example as disclosed in relation to the flow diagrams of any one of FIGs. 4, 6 and 8, and related features thereof.
Names of network elements, protocols, and methods are based on current standards. In other versions or other technologies, the names of these network elements and/or protocols and/ or methods may be different, as long as they provide a corresponding functionality. For example, embodiments may be deployed in 2G/3G/4G/5G networks and further generations of 3GPP but also in non-3GPP radio networks such as WiFi.
A memory may be volatile or non-volatile. It may be e.g. a RAM, a SRAM, a flash memory, a FPGA block ram, a DCD, a CD, a USB stick, and a blue ray disk.
If not otherwise stated or otherwise made clear from the context, the statement that two entities are different means that they perform different functions. It does not necessarily mean that they are based on different hardware. That is, each of the entities described in the present description may be based on a different hardware, or some or all of the entities may be based on the same hardware. It does not necessarily mean that they are based on different software. That is, each of the entities described in the present description may be based on different software, or some or all of the entities may be based on the same software. Each of the entities described in the present description may be embodied in the cloud. Implementations of any of the above described blocks, apparatuses, systems, techniques or methods include, as non-limiting examples, implementations as hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Some embodiments may be implemented in the cloud.
It is to be understood that what is described above is what is presently considered the preferred embodiments. However, it should be noted that the description of the preferred embodiments is given by way of example only and that various modifications may be made without departing from the scope as defined by the appended claims.

Claims

An apparatus comprising means for:
tracking a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space;

identifying a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region; and

responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, modifying the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene.
The apparatus of claim 1, wherein the first orientation corresponds to the orientation of the user's head and the first region corresponds to a region to the front of the user's head.
The apparatus of claim 2, wherein the first region comprises a sector to the front of the user's head having a central angle of less than 180 degrees.
The apparatus of claim 3, wherein the central angle is between 30 and 60 degrees.
The apparatus of any preceding claim, wherein the first position corresponds to the first orientation and the second position corresponds to a second orientation, wherein the tracking means is configured to track angular movement between the first and second orientations and the modifying means is configured to modify the respective perceived spatial positions of the first subset of the audio sources if the tracked angular movement is within a predefined angular range of movement.
The apparatus of claim 5 when dependent on claim 3 or claim 4, wherein the predefined angular range corresponds with the central angle of the first region sector.
The apparatus of any of claims 1 to 4, wherein the first and second positions correspond to respective first and second spatial positions of the real-world space, wherein the tracking means is configured to track translational movement between the first and second spatial positions and the modifying means is configured to modify the respective perceived spatial positions of the first subset of the audio sources if the tracked translational movement is within a predefined translational range of movement.
The apparatus of any preceding claim, wherein the modifying means is further configured, responsive to the tracked movement going beyond the predefined range of movement, to disable further modification of the respective perceived spatial positions of the first subset of the audio sources.
The apparatus of any preceding claim, further comprising means for:
determining that, subsequent to movement of the user position from the first position to the second position within the real-world space, the tracked user movement over a predetermined time period is below a predetermined movement amount; and

updating, in response to said determination, the first region of the real-world space such that it is defined with respect to the orientation of the user at the second position.
The apparatus of claim 9, further comprising means for updating the respective perceived spatial positions of the first subset of the audio sources such that they are returned to their previous respective spatial positions in the spatial audio scene with respect to the second subset of the audio sources.
The apparatus of any preceding claim, further comprising means for:
identifying that the first subset of audio sources, comprising a plurality of audio sources, are of interest to the user based on one or more tracked movement characteristics; and

responsive to said identification, modifying the respective perceived spatial positions of the first subset of audio sources such that they are perceived as more spatially spread apart, at least temporarily.
The apparatus of any preceding claim, wherein limits of the first region and/or the permissible range of movement are dynamically changeable based on the distance of at least one of the first subset of the audio sources from the position of the user.
The apparatus of any preceding claim, wherein the audio output device comprises a set of earphones.
The apparatus of any preceding claim, wherein the modifying means is configured to perform said modification in response to detecting that the data representing the spatial audio data scene is representative of a particular type of spatial audio scene and/or has associated metadata indicative that said modification is to be performed for the spatial audio scene.
A method, the method comprising:
tracking a user movement in a real-world space, wherein the user consumes, via an audio output device, data representing a spatial audio scene comprised of a plurality of audio sources perceived from respective spatial positions with respect to a first orientation of the user in the real-world space;

identifying a first subset of the audio sources having respective perceived spatial positions within a first region of the real-world space defined with respect to the first orientation and a second subset of the audio sources having respective perceived spatial positions outside of the first region; and

responsive to a tracked movement of the user position from a first position to a second position within the real-world space, the movement being within a predefined range of movement, modifying the respective perceived spatial positions of the first subset of the audio sources in the spatial audio scene so as to counter the movement of the user position without modifying the respective perceived spatial positions of the second subset of the audio sources in the spatial audio scene.