WO2020148120A2

WO2020148120A2 - Processing audio signals

Info

Publication number: WO2020148120A2
Application number: PCT/EP2020/050282
Authority: WO
Inventors: Antti Johannes Eronen; Lasse Juhani Laaksonen; Jussi Artturi LEPPÄNEN; Arto Juhani Lehtiniemi
Original assignee: Nokia Technologies Oy
Priority date: 2019-01-18
Filing date: 2020-01-08
Publication date: 2020-07-23
Also published as: EP3684083A1

Description

Processing Audio Signals

Field

This specification relates to processing audio signals and, more specifically, to processing audio signals in a virtual audio scene.

Background

Combining audio signals from multiple sound sources of a virtual audio scene is known. However, there remains a need for alternative arrangements for combining audio signals in a virtual sound source in which a user can move relative to the sound sources.

Summary

In a first aspect, this specification describes an apparatus comprising: means for receiving audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene; means for processing audio data of said first sound sources, comprising means for modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and means for processing audio data of said second sound sources, comprising means for modifying audio data of said second sound sources by an audio gain that is based on a distance function for the respective second sound source, wherein the distance function for each second sound source is based, at least in part, on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.

In some embodiments, the distance function for each first sound source is based on the distance from the virtual location of the user within the virtual audio scene to the location of the respective first sound source within the virtual audio scene.

Some embodiments comprise means for selecting said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene by determining the closest of said initial and said one or more virtual positions to said virtual location of the user. Alternatively, or in addition, some embodiments comprise means for selecting said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene by determining which of said initial and said one or more virtual positions is in a same sector of said virtual audio scene as said virtual location of said user.

The means for processing audio data of said first sound sources may further comprise processing audio data of one or more of the first sound sources based on an orientation of the user relative to the position of the respective audio sound source within the virtual audio scene.

The means for processing audio data of said second sound sources may further comprise processing audio data of one or more of the second sound sources based on an orientation of the user relative to the initial position of the respective second sound source within the virtual audio scene.

The distance function for the respective first sound source and/or the distance function for the respective second sound source may be definable parameter(s) (e.g. user-definable parameters).

The distance function for the respective second sound source may be selected from a plurality of distance functions based, at least in part, on the selected one of the initial and one or more virtual positions of the respective second sound source.

The initial and/or one or more of the virtual positions of said second sound sources maybe provided as metadata of said audio data. Some embodiments further comprise means for determining the virtual location and/or an orientation of the user within the virtual audio scene.

An input maybe provided for receiving the virtual location and/or an orientation of the user within the virtual audio scene.

Some embodiments further comprise means for generating an audio output, including means for combining the processed audio data of said first and second sound sources.

The said means may comprise: at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured, with the at least one processor, to cause the performance of the apparatus. In a second aspect, this specification describes a method comprising: receiving audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene; processing audio data of said first sound sources, comprising modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and processing audio data of said second sound sources, comprising modifying audio data of one or more of said second sound sources by an audio gain that is based, at least in part, on a distance function for the respective second sound source, wherein the distance function for each second sound source is based on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.

The distance function for each first sound source may be based on the distance from the virtual location of the user within the virtual audio scene to the location of the respective first sound source within the virtual audio scene. The method may comprise selecting said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene by

determining the closest of said initial and said one or more virtual positions to said virtual location of the user. Alternatively, or in addition, the method may comprise selecting said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene by determining which of said initial and said one or more virtual positions is in a same sector of said virtual audio scene as said virtual location of said user.

Processing audio data of said first sound sources may further comprise processing audio data of one or more of the first sound sources based on an orientation of the user relative to the position of the respective audio sound source within the virtual audio scene.

Processing audio data of said second sound sources may further comprise processing audio data of one or more of the second sound sources based on an orientation of the user relative to the initial position of the respective second sound source within the virtual audio scene. The distance function for the respective first sound source and/or the distance function for the respective second sound source maybe definable parameter(s) (e.g. user-definable parameters). The distance function for the respective second sound source may be selected from a plurality of distance functions based, at least in part, on the selected one of the initial and one or more virtual positions of the respective second sound source.

The initial and/ or one or more of the virtual positions of said second sound sources may be provided as metadata of said audio data.

Some embodiments further comprise determining the virtual location and/or an orientation of the user within the virtual audio scene. Some embodiments further comprise generating an audio output, including means for combining the processed audio data of said first and second sound sources.

In a third aspect, this specification describes any apparatus configured to perform any method as described with reference to the second aspect.

In a fourth aspect, this specification describes computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform any method as described with reference to the second aspect. In a fifth aspect, this specification describes a computer program comprising instructions for causing an apparatus to perform at least the following: receive audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene; process audio data of said first sound sources, comprising modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and process audio data of said second sound sources, comprising modifying audio data of one or more of said second sound sources by an audio gain that is based, at least in part, on a distance function for the respective second sound source, wherein the distance function for each second sound source is based on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.

In a sixth aspect, this specification describes a computer-readable medium (such as a non- transitory computer readable medium) comprising program instructions stored thereon for performing at least the following: receiving audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene; processing audio data of said first sound sources, comprising modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and processing audio data of said second sound sources, comprising modifying audio data of one or more of said second sound sources by an audio gain that is based, at least in part, on a distance function for the respective second sound source, wherein the distance function for each second sound source is based on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.

In a seventh aspect, this specification describes an apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to: receive audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene; process audio data of said first sound sources, comprising modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and process audio data of said second sound sources, comprising modifying audio data of one or more of said second sound sources by an audio gain that is based, at least in part, on a distance function for the respective second sound source, wherein the distance function for each second sound source is based on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene. In an eighth aspect, this specification describes an apparatus comprising: a first input for receiving audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene; a first processor for processing audio data of said first sound sources, comprising means for modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and a second processor for processing audio data of said second sound sources, comprising means for modifying audio data of said second sound sources by an audio gain that is based on a distance function for the respective second sound source, wherein the distance function for each second sound source is based, at least in part, on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene. The first and second processor may be implemented using the same processor.

Brief description of the drawings

So that the invention may be fully understood, embodiments thereof will now be described with reference to the accompanying drawings, in which:

Fig. l is a block diagram of a virtual reality display system in which example embodiments maybe implemented;

Figs. 2 to 4 show virtual environments demonstrating example uses of the system of Fig. l; Fig. 5 shows a virtual environment demonstrating an aspect of an example embodiment;

Fig. 6 shows a virtual environment demonstrating an aspect of an example embodiment;

Fig. 7 is a flow chart showing an algorithm in accordance with an example embodiment;

Fig. 8 shows a virtual environment demonstrating an aspect of an example embodiment;

Fig. 9 shows a virtual environment demonstrating an aspect of an example embodiment; Fig. to is a flow chart showing an algorithm in accordance with an example embodiment;

Fig. li is a flow chart showing an algorithm in accordance with an example embodiment;

Fig. 12 shows a virtual environment demonstrating an aspect of an example embodiment;

Fig. 13 is a plot showing distance functions in accordance with example embodiments;

Figs. 14 to 17 are block diagrams of systems in accordance with example embodiments; and FIGS. 18A and 18B show tangible media, respectively a removable memory unit and a compact disc (CD) storing computer-readable code which when run by a computer perform operations according to example embodiments. Detailed description

In the description and drawings, like reference numerals refer to like elements throughout.

Virtual reality (VR) is a rapidly developing area of technology in which one or both of video and audio content is provided to a user device. The user device may be provided with a live or stored feed from a content source, the feed representing a virtual reality space or world for immersive output through the user device. In VR systems including audio (with our without visual signals), the audio which may be spatial audio representing captured or composed audio from multiple audio objects. A virtual space or virtual world is any computer- generated version of a space, for example a captured real world space, in which a user can be immersed through a user device such as a virtual reality headset. The virtual reality headset maybe configured to provide one or more of virtual reality video and spatial audio content to the user, e.g. through the use of a pair of video screens and/or headphones. Position and/ or movement of a user device within a virtual environment can enhance an immersive experience. Some virtual reality user devices provide a so-called three degrees of freedom (3D0F) system, in which head movement in the yaw, pitch and roll axes are measured and determine what the user sees and/ or hears. This facilitates the scene remaining largely static in a single location as the user rotates their head. A next stage may be referred to as 3D0F+ which may facilitate limited translational movement in Euclidean space in the range of, e.g. tens of centimetres, around a location. A yet further stage is a six degrees-of-freedom (6D0F) system, where the user is able to freely move in the Euclidean space and rotate their head in the yaw, pitch and roll axes. Volumetric virtual reality content comprises data representing spaces and/ or objects in three-dimensions from all angles, enabling the user to move fully around the spaces and/or objects to view and/or hear them from any angle.

For the avoidance of doubt, references to virtual reality (VR) are also intended to cover related technologies such as mixed reality (MR) and augmented reality (AR) that refers to a real-world view that is augmented by computer-generated sensory input.

Fig. 1 is a schematic illustration of a virtual reality display system 10 which represents example user-end equipment. The virtual reality display system 10 includes a user device in the form of a virtual reality headset 14, for displaying visual data and presenting audio data for a virtual reality space, and a virtual reality media player 12 for rendering visual and audio data on the virtual reality headset 14. In some example embodiments, a separate user control (not shown) maybe associated with the virtual reality display system, e.g. a hand-held controller.

In the context of this specification, a virtual space, world or environment is a computer- generated version of a space, for example a captured real world space, in which a user can be immersed. In some example embodiments, the virtual space may be entirely computer generated. The virtual reality headset 14 may be of any suitable type. The virtual reality headset 14 maybe configured to provide virtual reality video and/or audio content data to a user. As such, the user may be immersed in virtual space.

In the example virtual reality display system 10, the virtual reality headset 14 receives the virtual reality content data from a virtual reality media player 12. The virtual reality media player 12 may be part of a separate device that is connected to the virtual reality headset 14 by a wired or wireless connection. For example, the virtual reality media player 12 may include a games console, or a PC (Personal Computer) configured to communicate visual data to the virtual reality headset 14.

Alternatively, the virtual reality media player 12 may form part of the virtual reality headset

14·

The virtual reality media player 12 may comprise a mobile phone, smartphone or tablet computer configured to play content through its display. For example, the virtual reality media player 12 maybe a touchscreen device having a large display over a major surface of the device, through which video content can be displayed. The virtual reality media player 12 may be inserted into a holder of a virtual reality headset 14. With such virtual reality headsets 14, a smart phone or tablet computer may display visual data which is provided to a user’s eyes via respective lenses in the virtual reality headset 14. The virtual reality audio may be presented, e.g., by loudspeakers that are integrated into the virtual reality headset 14 or headphones that are connected to it. The virtual reality display system 10 may also include hardware configured to convert the device to operate as part of virtual reality display system 10. Alternatively, the virtual reality media player 12 maybe integrated into the virtual reality headset 14. The virtual reality media player 12 maybe implemented in software. In some example embodiments, a device comprising virtual reality media player software is referred to as the virtual reality media player 12.

The virtual reality display system 10 may include means for determining the spatial position of the user and/or orientation of the user’s head. This maybe by means of determining the spatial position and/or orientation of the virtual reality headset 14. Over successive time frames, a measure of movement may therefore be calculated and stored. Such means may comprise part of the virtual reality media player 12. Alternatively, the means may comprise part of the virtual reality headset 14. For example, the virtual reality headset 14 may incorporate motion tracking sensors which may include one or more of gyroscopes, accelerometers and structured light systems. These sensors generate position data from which a current visual field-of-view (FOV) is determined and updated as the user, and so the virtual reality headset 14, changes position and/or orientation. The virtual reality headset 14 may comprise two digital screens for displaying stereoscopic video images of the virtual world in front of respective eyes of the user, and also two headphones, earphone or speakers for delivering audio. The example embodiments herein are not limited to a particular type of virtual reality headset 14.

In some example embodiments, the virtual reality display system 10 may determine the spatial position and/or orientation of the user’s head using the above-mentioned six degrees- of-freedom method. These may include measurements of pitch, roll and yaw and also translational movement in Euclidean space along side-to-side, front-to-back and up-and- down axes.

The virtual reality display system 10 may be configured to display virtual reality content data to the virtual reality headset 14 based on spatial position and/ or the orientation of the virtual reality headset. A detected change in spatial position and/or orientation, i.e. a form of movement, may result in a corresponding change in the visual and/or audio data to reflect a position or orientation transformation of the user with reference to the space into which the visual data is projected. This allows virtual reality content data to be consumed with the user experiencing a 3D virtual reality environment.

In the context of volumetric virtual reality spaces or worlds, a user’s position may be detected relative to content provided within the volumetric virtual reality content, e.g. so that the user can move freely within a given virtual reality space or world, around individual objects or groups of objects, and can view and/ or listen to the objects from different angles depending on the rotation of their head.

Audio data may be provided to headphones provided as part of the virtual reality headset 14. The audio data may represent spatial audio source content. Spatial audio may refer to directional rendering of audio in the virtual reality space or world such that a detected change in the user’s spatial position or in the orientation of their head may result in a corresponding change in the spatial audio rendering to reflect a transformation with reference to the space in which the spatial audio data is rendered. The angular extent of the environment observable or hearable through the virtual reality headset 14 is called the visual or audible field of view (FOV). The actual FOV observed by a user in terms of visuals depends on the inter-pupillary distance and on the distance between the lenses of the virtual reality headset 14 and the user’s eyes, but the FOV can be considered to be approximately the same for all users of a given display device when the virtual reality headset is being worn by the user. The audible FOV can be omnidirectional and independent of the visual FOV in terms of the angular direction (yaw, pitch, roll), or it can relate to the visual FOV.

Fig. 2 shows a virtual environment, indicated generally by the reference numeral 20, that maybe implemented using the virtual reality display system 10. The virtual environment 20 shows a user 22 and first to fourth sound sources 24 to 27. The user 22 may be wearing the virtual reality headset 14 described above.

The virtual environment 20 is a virtual audio scene and the user 22 has a position and an orientation within the scene. The audio presented to the user 22 (e.g. using the virtual reality headset 14) is dependent on the position and orientation of the user 22, such that a 6D0F audio scene is provided.

Fig. 3 shows a virtual environment, indicated generally by the reference numeral 30, in which the orientation of the user 22 has changed relative to the orientation shown in Fig. 2. The user position is unchanged. By changing the presentation to the user 22 of the audio from the sound sources 22 to 27 on the basis of the orientation of the user, an immersive experience can be enhanced.

Fig. 4 shows a virtual environment, indicating generally by the reference numeral 40, in which the position of the user 22 has changed relative to the position shown in Fig. 2 (indicated by the translation arrow 42), but the orientation of the user is unchanged relative to the orientation shown in Fig. 2. By changing the presentation to the user 22 of the audio from the sound sources 22 to 27 on the basis of the position of the user (e.g. by making audio sources louder and less reverberant and the user approaches the audio source in the virtual environment), an immersive experience can be enhanced. Clearly, both the position and the orientation of the user 22 could be changed at the same time. It is also noted that a virtual audio environment can include both diegetic and non- diegetic audio elements, i.e., audio elements that are presented to the user from a static direction/position of the virtual environment during change in user orientation and audio elements that are presented from an unchanged direction/position of the virtual environment regardless of any user rotation.

Fig. 5 shows a virtual environment, indicated generally by the reference numeral 50, demonstrating aspect of an example embodiment. The virtual environment 50 includes a first user position 52a and a plurality of sound sources. The sound sources are provided from different sound source zones. In the example virtual environment 50, a first zone 54, a second zone 55 and a third zone 56 are shown. Two sound sources are shown in each sound source zone (such that the first zone 54 includes sound sources 54a and 54b, the second zone 55 includes sound sources 55a and 55b, and the third zone 56 includes sound sources 56a and

56b). Of course, the number of zones and the number of sound sources within each zone can vary in different example embodiments and it is not essential that each zone includes the same number of sound sources. The virtual environment 50 also includes an anchor sound zone 57 comprising a first anchor sound source 57a and a second anchor sound source 57b. Again, the number and location of anchor sound zones within the virtual environment 50 and the number and location of sound sources within an anchor sound zone can be varied.

Further details regarding anchor sound sources are provided below. The sound sources in the first, second and third zones 54 to 56 are sometimes collectively referred to below as“first sound sources” and sound sources in the anchor source zone are sometimes collectively referred to below as“second sound sources”.

In one example embodiment, the virtual environment 50 represents an audio scene with different subsets of instruments being provided in the different zones 54, 55 and 56. For example, the first zone 54 may include guitar sounds, the second zone 55 may include keyboard and backing vocal sounds, and the third zone 56 may include lead vocal sounds. As the user moves around the virtual environment 50, the relative volumes of the different sound sources changes, thereby providing an immersive effect. For example, with the user at the first user position 52a, each of the sound sources may have a similar volume, but when the user is at a second user position 52b, the audio of the third zone 56 (i.e. lead vocals) may be louder such that user appears to move towards the lead singer within the scene. (Other effects, such as reverberation, may also be adjusted.) The audio provided to the user may also be adjusted based on the orientation of the user within the virtual environment.

The sounds of anchor sound zone 57 may define sounds (e.g. important sounds) that are intended to be heard throughout the virtual environment 50. The sounds of anchor sound zone 57 may be provided close to the centre of the virtual environment 50, although this is not essential to all example embodiments. In the example audio output described above, the first anchor sound source 57a may provide drum sounds and the second anchor sound source 57b may provide bass sounds, such that the anchor sound zone 57 provides drum and bass audio for the virtual environment 50.

Thus, as the user moves from the first user position 52a to the second user position 52b, the sounds from the third zone 56 are accentuated, but the sounds from the anchor sound zone 57 remain strong.

Fig. 6 shows a virtual environment, indicated generally by the reference numeral 60, demonstrating aspect of an example embodiment. The virtual environment 60 includes the first zone 54, second zone 55, third zone 56 and anchor sound zone 57 described above with reference to the virtual environment 50. As with the virtual environment 50, two sound sources are shown in each sound source zone (such that the first zone 54 includes sound sources 54a and 54b, the second zone 55 includes sound sources 55a and 55b, the third zone 56 includes sound sources 56a and 56b, and the anchor sound zone 57 includes anchor sound sources 57a and 57b).

The virtual environment 60 is consumed by a user. The user has a first position 62a close to the anchor zone (and similar to the first user position 52a described above). The user moves (as indicated by arrow 64) to a second position 62b that is within the third zone 56. Thus, as the user moves from the first user position 62a to the second user position 62b, the audio of the third zone 56 (e.g. lead vocals) maybe increased and audio from other zone (including the anchor zone) may be reduced. Indeed, in some embodiments, the sounds from the anchor sound zone 57 may be too quiet when the user is at the second user position 62b, unless the anchor sounds are made so loud that they are too loud with the user in other locations (such as the first position 62a).

In some embodiments, suitable distance attenuation (or for example, no distance

attenuation) differing from distance attenuation used for the first sound sources may be used for the second/anchor sound sources. In addition to gain adjustment, the rendering of the second sound sources may be altered using, e.g., spatial extent processing depending on the user distance. For example, even if the gain of the second sound sources is not reduced greatly due to user distance, increasing the spatial extent of the sound sources can still convey to the user a feeling of immersion and the effect of their action (movement in the virtual environment).

Fig. 7 is a flow chart showing an algorithm, indicated generally by the reference numeral 70, in accordance with an example embodiment. The algorithm 70 is described below with reference to the virtual environment 80 shown in Fig. 8. Fig. 8 shows a virtual environment, indicated generally by the reference numeral 80, demonstrating an aspect of an example embodiment. The virtual environment 80 includes the first, second and third zones 54 to 56 described above, and an anchor zone 85, wherein two sound sources are shown in each sound source zone (such that the first zone 54 includes sound sources 54a and 54b, the second zone 55 includes sound sources 55a and 55b, the third zone 56 includes sound sources 56a and 56b, and the anchor zone 85 includes anchor sound sources 85a and 85b). (Again, the number of sound sources per zone could be varied.

Alternatively, or in addition, the number of zones could be varied.) The virtual environment 80 is consumed by a user. The user has a first position 82a close to the anchor zone. The user moves (as indicated by arrow 84) to a second position 82b that is close to the third zone 56.

The algorithm 70 starts at operation 72 where audio data is received relating to a plurality of audio sound sources of a virtual audio scene (such as the virtual environment 80). The plurality of audio sound sources comprises one or more first sound sources (such as one or more of the sound sources 54a, 54b, 55a, 55b, 56a and 56b of the first to third zones) and one or more second sound sources (e.g. one or more of the anchor sound sources 85a and 85b of the anchor zone 85). Each first sound source has a position within the virtual audio scene. Moreover, each second sound source has an initial position (such as the positions 85a and/or 85b) within the virtual audio scene and one or more virtual positions (such as the positions

86, 87 and/or 88 shown in Fig. 8) within the virtual audio scene.

At operation 73, the first audio sound sources are processed, including modifying audio data of one or more of said first sound sources by an audio gain (e.g. an attenuation) that is based on a distance function for the respective first sound source. For example, a degree of attenuation of each of the first audio sound sources may increase as the user moves away from the respective sound source. As discussed further below, the attenuation may be based on a l/distance function, although many alternative arrangements (including user-definable arrangements) are possible, for example to allow for artistic intent by the creator of the virtual environment to be implemented.

At operation 74, the first audio sound sources are processed, including processing audio data of at least some of the first sound sources based on an orientation of the user relative to the respective audio sound source within the virtual audio scene.

At operation 75, the second sound sources are processed, including modifying audio data of one or more of said second sound sources by an audio gain (e.g. an attenuation) that is based on a distance function for the respective second sound source. For example, a degree of attenuation of each of the second audio sound sources may increase as the user moves away from the respective sound source. As discussed further below, the attenuation may be based on a l/distance function, although many alternative arrangements (including user-definable arrangements) are possible.

As described further below, the distance function for each first sound source (used in the operation 73 described above) may be based on the distance from the virtual location of the user within the virtual audio scene to the location of the respective first sound source within the virtual audio scene. The distance function for each second sound source (referred to in the operation 75 described above) may be based on a distance from a virtual location of the user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.

At operation 76, the second sound sources are processed, including processing audio data of said second sound sources based on an orientation of the user relative to a position of the respective second sound source within the virtual audio scene. In one embodiment, the operation 76 processes the second sound source based on the orientation of the user relative to the initial position of that second sound source, regardless of whether the initial or the virtual anchor sound source is used in the operation 75. Thus, directionality (operation 76) may be dependent on the position of the relevant initial second/ anchor sound source and the attenuation (operation 75) maybe based on the position of a selected one of the

second/anchor sound source and a virtual second/anchor sound source.

Many variants of the algorithms 70 are possible. For example, the operations 73 to 76 may be performed in a different order and/or some of the operations may be combined. Thus, for example, the operations 73 and 74 may be merged into a single operation and/ or the operations 75 and 76 may be merged into a single operation. In some embodiments, some operations of the algorithm 70 may be omitted. For example, the operations 74 and 76 may be omitted in the event that orientation processing is not provided. For example, a mono or stereo mix of a multi-track audio content dependent on user location or distance from at least one reference point can be thus be implemented without explicit head orientation tracking or processing.

One such application is described as follows, where a distance tracking that corresponds to a route is provided, rather than a volumetric virtual environment. In this example application, a user is provided a multi-track audio soundtrack for inspiring background music during a jogging exercise. The audio is presented to the user using a mobile device application and a traditional Bluetooth headphone device that provides no headtracking capability. A GPS tracking of the user along the jogging route is utilized to control the balance of the mixing of the multi-track audio content. In the beginning of the jog, for example, a relaxed acoustic instrumentation (first sound sources) over a drum and base anchor content (second sound sources) is provided. Towards the middle of the jogging route, the acoustic instrumentation (first sound sources) fades out and it is faded in a more aggressive electric instrumentation (first sound sources) while maintaining the audibility of the second sound sources. As the user approaches the end of the jogging route, vocal tracks (first sound sources) pushing the user towards the finish are added. In the example virtual environment 80 described above, the one or more second sound sources are mapped onto a single virtual position within each zone of the virtual audio scene (the positions 86, 87 and 88). This is not essential to all embodiments. For example, at least some of the one or more second sound sources may be mapped to a subset of said zones. Moreover, multiple virtual positions may be provided within each audio zone, with different second sound sources being mapped to different virtual positions within each audio zone.

By way of example, Fig. 9 shows a virtual environment, indicated generally by the reference numeral 90, demonstrating an aspect of an example embodiment. The virtual environment 90 includes the first zone 54, second zone 55 and third zone 56 described above, and an anchor zone 93, wherein two sound sources are shown in each sound source zone (such that the first zone 54 includes sound sources 54a and 54b, the second zone 55 includes sound sources 55a and 55b, the third zone 56 includes sound sources 56a and 56b, and the anchor zone 93 includes anchor sound sources 93a and 93b). The virtual environment 90 has a user 92. Two virtual anchor sound sources are provided in each of said zones (one for each of the anchor sound sources 93a and 93b). Thus, the first zone 54 includes first and second virtual anchor sound sources 94a and 94b, the second zone 55 includes first and second virtual anchor sound sources 95a and 95b, and the third zone 56 includes first and second virtual anchor sound sources 96a and 96b. Fig. 10 is a flow chart showing an algorithm, indicated generally by the reference numeral too, in accordance with an example embodiment. The algorithm too shows an example implementation of the operation 75 described above in which the second audio is processed based on a selected distance function. Consider the virtual environment 80 described above in which the user is in the first position

82a. As discussed above, the audio data of the anchor sound sources (the anchor sound sources 85a and 85b) are modified based on a distance function for the respective anchor sound source, wherein the distance function is based on a distance from the first position 82a of the user within the virtual environment 80 to the location of a selected one of the initial and one or more virtual positions of the anchor sound sources. Consider, by way of example, the first anchor sound source 85a. The algorithm 100 starts at operation 102, where a distance from the user (the first position 82a) to the initial position of the second sound source (the position 85a) is determined.

At operation 104, the distances from the user (the first position 82a) to the virtual positions of the second sound source (such as the positions 86, 87 and 88) are determined.

At operation 106, the minimum of the distances determined in the operations 102 and 104 above is selected and, at operation 108, the distance function for the operation 75 is set based on the distance selected in operation 106. Thus, in operation 106, said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene is selected by determining the closest of said initial and said one or more virtual positions to said user position (in this case, the initial position 85a is closest to the first position 82a). The distance function is therefore based on the distance from the first position 82a to the initial position of the first anchor sound source 85a.

Now, consider the virtual environment 80 described above in which the user 82 has moved to the second position 82b. As discussed above, the audio data of the anchor sound sources (the anchor sound sources 85a and 85b) are modified based on a distance function for the respective anchor sound source, wherein the distance function is based on a distance from the second position 82b of the user within the virtual environment 80 to the location of a selected one of the initial and one or more virtual positions of the anchor sound sources. Consider, again, the first anchor sound source 85a. The algorithm too starts at operation 102, where a distance from the user (the second position 82b) to the initial position of the second sound source (the position 85a) is determined.

At operation 104, the distances from the user (the second position 82b) to the virtual positions of the second sound source (such as the positions 86, 87 and 88) are determined. At operation 106, the minimum of the distances determined in the operations 102 and 104 above is selected and, at operation 108, the distance function for the operation 75 is set based on the distance selected in operation 106. Thus, in operation 106, said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene is selected by determining the closest of said initial and said one or more virtual positions to said user position (in this case, the third virtual position 88 is closest to the second position 82b). The distance function in this instance is therefore based on the distance from the second position 82b to the third virtual position 88 of the first anchor sound source 85a.

The algorithm 100 is not the only mechanism by which the distance function may be set. An alternative arrangement is described below, although yet further alternatives are possible. Fig. 11 is a flow chart showing an algorithm, indicated generally by the reference numeral 110, in accordance with an example embodiment. The algorithm 110 is described with reference to the virtual environment 120 shown in Fig. 12.

Fig. 12 shows a virtual environment, indicated generally by the reference numeral 120, demonstrating an aspect of an example embodiment. The virtual environment 120 includes the first, second and third zones 54 to 56 described above, and an anchor zone 125, wherein the first zone 54 includes first sound sources 54a and 54b and a first virtual second (or anchor) sound source 126, the second zone 55 includes first sound sources 55a and 55b and a second virtual second (or anchor) sound source 127, the third zone 56 includes first sound sources 56a and 56b and a third virtual second (or anchor) sound source 128, and the anchor zone 125 includes second (or anchor) sound sources 125a and 125b. (Again, the number of sound sources per zone could be varied. Alternatively, or in addition, the number of zones could be varied.) The virtual environment 120 is consumed by a user. The user has a first position 122a close to the anchor zone. The user moves (as indicated by arrow 124) to a second position 122b that is close to the third zone 56.

The virtual environment 120 is divided into sectors. A first sector 129a includes the first zone 54, a second sector 129b includes the second zone 55, a third sector 129c includes the third zone 56 and a fourth sector i29d includes the anchor zone 125.

Assume that the user is initially in the position 122a. - l8 -

The algorithm no starts at operation 112, where the user position (the position 122a) is determined. At operation 114, the sector in which the user position is located is determined (the sector i29d). At operation 116, the distance function is set based on which of the initial and virtual positions of the second sound source fall within the same sector. In the virtual environment 120, it can be seen that the initial position of the second sound source in the same sector (the sector i29d) as the first position 122a and so the distance function is determined based on the initial position.

Now, assume that the user moves to the position 122b.

The algorithm 110 starts at operation 112, where the user position (the position 122b) is determined. At operation 114, the sector in which the user position is located is determined (the sector i29d). At operation 116, the distance function is set based on which of the initial and virtual positions of the second sound source fall within the same sector. In the virtual environment 120, it can be seen that the initial position of the second sound source in the same sector (the sector i29d) as the second position 122b and so the distance function is determined based on the initial position.

Thus, the algorithm 110 (based on the virtual environment 120) comes to a different conclusion to the algorithm 100 (based on the virtual environment 80) when the user is in the second position 82b (in Fig. 8) and 122b (in Fig. 12).

As suggested above, the processing of the audio signals in the operations 73 and 75 may be based on a i/distance function.

Fig. 13 is a plot, indicated generally by the reference numeral 130, showing distance functions in accordance with example embodiments. The plot 130 includes distance from a sound source plotted on the x-axis and gain plotted on the y-axis. An initial position 131 of a second (or anchor) sound source is plotted, together with a virtual position 132 of the second (or anchor) sound source. By way of example, the position 131 may indicate the position of the second sound source 85a described above and the position 132 may indicate the virtual position 88 of the second sound source.

A first curve 133 plots gain as a function of the distance of a user from the position 131 of the second sound source. When the user is close to the position 131, the gain is high (e.g. 4 in the example of plot 130), but when the user is far from the position 131, the gain is low (below 0.5 in the example of plot 130). A second curve 134 plots gain as a function of the distance of a user from the virtual sound source 132. When the user is close to the position 132, the gain is high (e.g. 4 in the example of plot 130), but when the user is far from the position 132, the gain is low (below 0.5 in the example of plot 130).

In operation 73 of the algorithm 70 described above, a first audio signal is processed based on a distance function. The distance function may, for example, have the form of the first curve 133, such that gain is reduced with distance between the user and the first audio signal. In operation 75 of the algorithm 70 described above, a second audio signal is processed based on a selected distance function. The distance function may, for example, be selected from the first and second curves 133 and 134.

In the plot 130, the curves 133 and 134 have the same maximum gain (4 in the example plot 130). This is not essential. For example, the second sound source at an initial position may have a higher gain than the second sound source at a virtual position. For example, Fig. 13 shows a virtual position 135 of the second sound source. A third curve 136 plots gain as a function of the distance of a user from the position 135. When the user is close to the position 135, the gain is high (e.g. 2.5 in the example of plot 130), but not as high as when the user is close to the position 132.

The distance functions described above are i/distance function; this is not essential to all embodiments. Other distance functions maybe provided. (Thus, for example, the slopes of some or all of the curves shown in Fig. 13 could be different.) Moreover, one or more of said distance functions for the respective first sound source and/ or the distance function for the respective second sound source may be definable parameter(s). Thus, for example, at least some of the curves of the plot 130 may be user-definable (e.g. using a user interface).

Fig. 14 is a block diagram of a system, indicated generally by the reference numeral 140, in accordance with an example embodiment.

The system 140 comprises a parameter selection module 141, a delay line 142, a distance gain function module 143, a filtering module 144, a reverberator 145, a first summing module 146 and a second summing module 147.

The parameter selection module 141 receives user position and orientation data and sound gain limits based on virtual positions and uses these data to provide control signals to the distance gain function module 143 and the filtering module 144. The delay line 142 receives audio input to the system 140 and generates a plurality of outputs having successively greater time delay. The distance gain function module 143 comprises multiple instances of modules implementing a distance function (as discussed, for example, with reference to Fig. 13). Each instance of the gain function module 143 operates on a differently delayed version of the audio input signal.

The filtering module 144 may modifies the outputs of the gain function module 143 based on user head orientation. The filtering module 144 may, for example implement a head related transfer function (HRTF).

The reverberator 145 receives the output of the undelayed instance of the distance gain function module and generates a reverberated version of that signal based on one or more reverberation parameters. The reverberator 145 may, for example, seek to recreate different sound spaces.

Finally, the summing modules 146 and 147 sum the output of the reverberator and the outputs of the filtering module 144 to provide separate left and right audio outputs to a user. The system 140 is provided by way of example only. The skilled person will be aware of many alternative arrangements. For example, at least some of the modules (such as the delay line 142 and / or the reverberator 145 may be omitted and one or more of the modules may be reconfigured). Fig. 15 is a block diagram of a system, indicated generally by the reference numeral 150, in accordance with an example embodiment. The system 150 comprises a head mounted display 152 and a rendering module 154. The rendering module comprises one or more of: memory; one or more processors and/or application logic modules; a graphics rendering module; input and output modules, such as a camera, a display, one or more sensors, audio output and audio playback; an orientation and position sensing module; and a radio module. The rendering module 154 may be implemented using a mobile communication device, such as a mobile phone. The rendering module 154 is provided byway of example only; many modifications, such as the provision of different combinations of modules, could be made. Fig. 16 is a block diagram of a system, indicated generally by the reference numeral 160, in accordance with an example embodiment. The system 160 comprises a decoder 172, a position and orientation module 174 and a rendering module 176. As indicated in Fig. 16, the decoder receives a bitstream. The decoder 172 converts the received bitstream into audio data and audio metadata. The metadata may, for example, include audio position information.

The position and orientation module 174 provides information relating to a position and an orientation of a user within a virtual environment. The rendering module receives the user position and orientation information, the audio data and the audio metadata and renders the audio accordingly. The rendering module 176 may, for example, be implemented using the system 140 described above.

By way of example, the position information for initial and virtual positions of an audio sound source (e.g. as output by the decoder 172) maybe provided in the following format:

<Audio_object_metadata>

</Position>

<Virtual_positions>

< /Virtual_positions>

For completeness, FIG. 17 is a schematic diagram of components of one or more of the example embodiments described previously, which hereafter are referred to generically as processing systems 300. A processing system 300 may have a processor 302, a memory 304 closely coupled to the processor and comprised of a RAM 314 and ROM 312, and, optionally, user input 310 and a display 318. The processing system 300 may comprise one or more network/apparatus interfaces 308 for connection to a network/apparatus, e.g. a modem which may be wired or wireless. Interface 308 may also operate as a connection to other apparatus such as device/apparatus which is not network side apparatus. Thus, direct connection between devices/apparatus without network participation is possible.

The processor 302 is connected to each of the other components in order to control operation thereof. The memory 304 may comprise a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD). The ROM 312 of the memory 314 stores, amongst other things, an operating system 315 and may store software applications 316. The RAM 314 of the memory 304 is used by the processor 302 for the temporary storage of data. The operating system 315 may contain code which, when executed by the processor implements aspects of the algorithms 70, 100 and 110 described above. Note that in the case of small device/apparatus the memory can be most suitable for small size usage i.e. not always hard disk drive (HDD) or solid-state drive (SSD) is used. The processor 302 may take any suitable form. For instance, it may be a microcontroller, a plurality of microcontrollers, a processor, or a plurality of processors.

The processing system 300 may be a standalone computer, a server, a console, or a network thereof. The processing system 300 and needed structural parts may be all inside

device/apparatus such as IoT device/apparatus i.e. embedded to very small size

In some example embodiments, the processing system 300 may also be associated with external software applications. These may be applications stored on a remote server device/apparatus and may run partly or exclusively on the remote server device/apparatus. These applications may be termed cloud-hosted applications. The processing system 300 maybe in communication with the remote server device/apparatus in order to utilize the software application stored there.

FIGS. 18A and 18B show tangible media, respectively a removable memory unit 365 and a compact disc (CD) 368, storing computer-readable code which when run by a computer may perform methods according to example embodiments described above. The removable memory unit 365 may be a memory stick, e.g. a USB memory stick, having internal memory 366 storing the computer- readable code. The memory 366 may be accessed by a computer system via a connector 367. The CD 368 may be a CD-ROM or a DVD or similar. Other forms of tangible storage media may be used. Tangible media can be any device/ apparatus capable of storing data/information which data/information can be exchanged between devices/apparatus/network.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a “memory” or“computer-readable medium” may be any non-transitory media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer. Reference to, where relevant,“computer-readable storage medium”,“computer program product”,“tangibly embodied computer program” etc., or a“processor” or“processing circuitry” etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices/apparatus and other

devices/apparatus. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device/apparatus as instructions for a processor or configured or configuration settings for a fixed function device/apparatus, gate array, programmable logic device/ apparatus, etc.

As used in this application, the term“circuitiy” refers to all of the following: (a) hardware- only circuit implementations (such as implementations in only analogue and/ or digital circuitry) and (b) to combinations of circuits and software (and/ or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above- described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagrams of Figures 7, 10 and 11 are examples only and that various operations depicted therein may be omitted, reordered and/ or combined.

It will be appreciated that the above described example embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present specification.

Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Claims

1. An apparatus comprising:

means for receiving audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene;

means for processing audio data of said first sound sources, comprising means for modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and

means for processing audio data of said second sound sources, comprising means for modifying audio data of said second sound sources by an audio gain that is based on a distance function for the respective second sound source, wherein the distance function for each second sound source is based, at least in part, on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.

2. An apparatus as claimed in claim 1, wherein the distance function for each first sound source is based on the distance from the virtual location of the user within the virtual audio scene to the location of the respective first sound source within the virtual audio scene.

3. An apparatus as claimed in claim 1 or claim 2, further comprising means for selecting said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene by determining the closest of said initial and said one or more virtual positions to said virtual location of the user.

4. An apparatus as claimed in any one of claims 1 to 3, further comprising means for selecting said selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene by determining which of said initial and said one or more virtual positions is in a same sector of said virtual audio scene as said virtual location of said user.

5. An apparatus as claimed in any one of the preceding claims, wherein the means for processing audio data of said first sound sources further comprises processing audio data of one or more of the first sound sources based on an orientation of the user relative to the position of the respective audio sound source within the virtual audio scene.

6. An apparatus as claimed in any one of the preceding claims, wherein the means for processing audio data of said second sound sources further comprises processing audio data of one or more of the second sound sources based on an orientation of the user relative to the initial position of the respective second sound source within the virtual audio scene.

7. An apparatus as claimed in any one of the preceding claims, wherein the distance function for the respective first sound source and/ or the distance function for the respective second sound source is/are definable parameter(s).

8. An apparatus as claimed in any one of the preceding claims, wherein the distance function for the respective second sound source is selected from a plurality of distance functions based, at least in part, on the selected one of the initial and one or more virtual positions of the respective second sound source.

9. An apparatus as claimed in any one of the preceding claims, wherein the initial and/or one or more of the virtual positions of said second sound sources are provided as metadata of said audio data.

10. An apparatus as claimed in any one of the preceding claims, further comprising means for determining the virtual location and/or an orientation of the user within the virtual audio scene.

11. An apparatus as claimed in any one of the preceding claims, further comprising an input for receiving the virtual location and/ or an orientation of the user within the virtual audio scene.

12. An apparatus as claimed in any one of the preceding claims, further comprising means for generating an audio output, including means for combining the processed audio data of said first and second sound sources.

13. An apparatus as claimed in any one of the preceding claims, wherein the means comprise:

at least one processor; and

at least one memory including computer program code, the at least one memory and the computer program code configured, with the at least one processor, to cause the performance of the apparatus.

14. A method comprising:

receiving audio data relating to a plurality of audio sound sources of a virtual audio scene, wherein the plurality of audio sound sources comprises one or more first sound sources and one or more second sound sources, wherein each first sound source has a position within the virtual audio scene and each second sound source has an initial position within the virtual audio scene and one or more virtual positions within the virtual audio scene;

processing audio data of said first sound sources, comprising modifying audio data of said first sound sources by an audio gain that is based, at least in part, on a distance function for the respective first sound source; and

processing audio data of said second sound sources, comprising modifying audio data of one or more of said second sound sources by an audio gain that is based, at least in part, on a distance function for the respective second sound source, wherein the distance function for each second sound source is based on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.

15. Computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform a method of:

processing audio data of said second sound sources, comprising modifying audio data of one or more of said second sound sources by an audio gain that is based on a distance function for the respective second sound source, wherein the distance function for each second sound source is based, at least in part, on a distance from a virtual location of a user within the virtual audio scene to the location of a selected one of said initial and one or more virtual positions for the respective second sound source within the virtual audio scene.