GB2568726A - Object prioritisation of virtual content - Google Patents
Object prioritisation of virtual content Download PDFInfo
- Publication number
- GB2568726A GB2568726A GB1719560.3A GB201719560A GB2568726A GB 2568726 A GB2568726 A GB 2568726A GB 201719560 A GB201719560 A GB 201719560A GB 2568726 A GB2568726 A GB 2568726A
- Authority
- GB
- United Kingdom
- Prior art keywords
- objects
- interaction
- data
- data stream
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012913 prioritisation Methods 0.000 title description 4
- 230000003993 interaction Effects 0.000 claims abstract description 197
- 238000000034 method Methods 0.000 claims abstract description 71
- 238000009877 rendering Methods 0.000 claims abstract description 64
- 230000005540 biological transmission Effects 0.000 claims abstract description 25
- 230000000007 visual effect Effects 0.000 claims description 28
- 230000033001 locomotion Effects 0.000 claims description 20
- 238000001914 filtration Methods 0.000 claims description 10
- 230000007704 transition Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 230000003139 buffering effect Effects 0.000 claims description 3
- 238000005562 fading Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 description 49
- 239000000203 mixture Substances 0.000 description 30
- 230000005236 sound signal Effects 0.000 description 12
- 230000008859 change Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 210000003128 head Anatomy 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000003190 augmentative effect Effects 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 230000009191 jumping Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N Silicium dioxide Chemical compound O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000004091 panning Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/14—Digital output to display device ; Cooperation and interconnection of the display device with other functional units
- G06F3/147—Digital output to display device ; Cooperation and interconnection of the display device with other functional units using display panels
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G5/00—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
- G09G5/02—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the way in which colour is displayed
- G09G5/026—Control of mixing and/or overlay of colours in general
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/472—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
- H04N21/4728—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for selecting a Region Of Interest [ROI], e.g. for requesting a higher resolution version of a selected region
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/816—Monomedia components thereof involving special video data, e.g 3D video
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2360/00—Aspects of the architecture of display systems
- G09G2360/18—Use of a frame buffer in a display terminal, inclusive of the display panel
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2370/00—Aspects of data communication
- G09G2370/02—Networking aspects
- G09G2370/022—Centralised management of display operation, e.g. in a server instead of locally
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Information Transfer Between Computers (AREA)
- Processing Or Creating Images (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Method and apparatus, comprising: providing virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects; receiving interaction data from a first remote user device indicative of a user interaction with virtual content and determining, based on the interaction data, a subset of the objects as being one or more interaction objects associated with the user of the remote user device; generating a first data stream including the one or more interaction objects; transmitting the first data stream to a remote user device; generating a second data stream including the remainder of the plurality of objects; and transmitting the second data stream to a remote user device. Also, receiving the first and second data streams and rendering objects from the data streams based on the interaction data.
Description
Object Prioritisation of Virtual Content
Field
This disclosure relates to methods and systems for object prioritisation of virtual content, including, but not necessarily limited to virtual reality, augmented reality and mixed reality content.
Background
Virtual reality (VR) is a rapidly developing area of technology in which audio and/or video content is provided to a user device, such as a headset. As is known, the user device may be provided with a live or stored feed from an audio and/or video content source, the feed representing a virtual reality space or world for immersive output through the user device. In some example embodiments, the audio maybe spatial audio. A virtual space or virtual world is any computer-generated version of a space, for example a captured real world space, in which a user can be immersed through the user device. For example, a virtual reality headset maybe configured to provide virtual reality video and/or audio content to the user, e.g. through the use of a pair of video screens and/or headphones incorporated within the headset.
Position and/or movement of the user device can enhance the immersive experience. Currently, most virtual reality headsets use so-called three degrees of freedom (3D0F) which means that the head movement in the yaw, pitch and roll axes are measured and determine what the user sees and/or hears. This facilitates the scene remaining largely static in a single location as the user rotates their head. A next stage may be referred to as 3D0F+ which may facilitate limited translational movement in Euclidean space in the range of, e.g. tens of centimetres, around a location. A yet further stage is a six degrees-of-freedom (6D0F) virtual reality system, where the user is able to freely move in the Euclidean space and rotate their head in the yaw, pitch and roll axes. Six degrees-of-freedom virtual reality systems and methods will enable the provision and consumption of volumetric virtual reality content.
Volumetric virtual reality content comprises data representing spaces and/or objects in three-dimensions from all angles, enabling the user to move fully around the spaces and/or objects to view them from any angle. For example, a person or object may be fully scanned and reproduced within a real-world space. When rendered to a virtual reality headset, the user may ‘walk around’ the person or object and view and/or hear them from the front, the sides and from behind. Users may also be able to interact with other objects, for example virtual objects (e.g. a computer-generated person or object or service) or real objects (e.g. another person involved in the same virtual scene.)
Generally speaking, immersive systems such as those described are bandwidth intensive, and providing interaction between users and objects (whether captured or artificial) requires low latency to produce a realistic user experience.
For the avoidance of doubt, references to virtual reality (VR) are also intended to cover related technologies such as augmented reality (AR) and mixed reality (MR.)
Summary
A first aspect provides a method comprising:
providing virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects;
receiving interaction data from a first remote user device indicative of a user interaction with one or more objects in the virtual content;
determining based on the interaction data a subset of the objects as being one or more interaction objects associated with the user of the remote user device;
generating a first data stream including the one or more interaction objects; transmitting the first data stream to a remote user device;
generating a second data stream including the remainder of the plurality of objects; and transmitting the second data stream to a remote user device
The first and second data streams may be transmitted using different, first and second transmission channels.
Determining the subset of the objects as being one or more interaction objects may be based on the distance of the user and the subset of the objects, indicated by the interaction data.
The subset of objects may comprise one or more of the objects indicated by the interaction data as being within a predetermined distance of the user.
Determining the subset of the objects as being one or more interaction objects may be based on a viewing direction of the user relative to the subset of the objects, indicated by the interaction data.
Determining the subset of the objects as being one or more interaction objects may be based on a direction of movement of the user, indicated by the interaction data.
Determining the subset of the objects as being one or more interaction objects may be based on using the interaction data to predict a future interaction between the user of the first remote user device and the subset of the objects.
The first data stream may be transmitted at a higher bit-rate than the second data stream.
The first and second data streams may be transmitted at variable bit-rates based on available bit-rate.
The method may further comprise performing position filtering of one or more of the interaction objects in the first data stream prior to transmitting.
The method may further comprise performing gain cross-fading of one or more of the interaction objects in the first data stream prior to transmitting.
The first and second data streams may comprise audio data, and wherein generating the second data stream may comprise substantially removing audio components corresponding to the one or more interaction objects from the first data stream.
The first and second data streams may be transmitted to the first remote user device.
The first and second data streams may be transmitted to one or more different remote user devices as from the first remote user device where the interaction data was received.
The interaction data from the first remote user device may comprise position data associated with the one or more interaction objects, and the method may further comprise transmitting the position data to one or more different remote user devices as from the first user device from where the interaction data was received.
The subset of interaction objects may comprise one or more of an audio object, a visual object, a haptic object and a smell object.
The method may further comprise, prior to transmitting the first and second data streams, transmitting transitions streams over both the first and second transmission channels which both comprise the subset of interaction objects.
Another aspect provides a method comprising:
receiving a first data stream comprising one or more interaction objects, being a subset of a plurality of objects in a set of virtual content;
receiving a second data stream comprising the remainder of the plurality of objects in the set of virtual content; and rendering objects from both the first and second received data streams based on interaction data from a first user device indicative of a user interaction with the set of virtual content.
The first and second data streams may be received from different, first and second transmission channels.
Rendering objects maybe based on interaction data indicative of the distance of the user and the subset of the objects.
The subset of objects may comprise one or more of the objects indicated by the interaction data as being within a predetermined distance of the user.
Rendering objects maybe based on interaction data indicative of a viewing direction of the user relative to the subset of the objects.
Rendering objects maybe based on interaction data indicative of a direction of movement of the user.
The method may further comprise buffering data from the first data stream for the subsequent rendering operation based on the interaction data being indicative of a predicted future interaction with the one or more interaction objects.
The first data stream may be received at a higher bit-rate than the second data stream is received.
The method may be performed at the first remote user device.
The method maybe performed at one or more different remote user devices as from the first user device.
The interaction data from the first user device may comprise position data associated with the one or more interaction object(s), and the method may further comprise receiving the position data at one or more different remote user devices as from the first user device.
The subset of interaction objects may comprise one or more of an audio object, a visual object, a haptic object and a smell object.
The method may further comprise, prior to receiving the first and second data streams, receiving transitions streams which both comprise the subset of interaction object(s).
Another aspect provides a computer program comprising instructions that when executed by a computer program control it to perform the method of any preceding definition.
Another aspect provides an apparatus configured to perform the method of any preceding definition.
Another aspect provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:
providing virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects;
receiving interaction data from a first remote user device indicative of a user interaction with virtual content;
determining based on the interaction data a subset of the objects as being one or more interaction objects associated with the user of the remote user device;
generating a first data stream including the one or more interaction objects; transmitting the first data stream to a remote user device;
generating a second data stream including the remainder of the plurality of objects; and transmitting the second data stream to a remote user device.
Another aspect provides an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
to provide virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects;
to receive interaction data from a first remote user device indicative of a user interaction with virtual content;
to determine based on the interaction data a subset of the objects as being one or more interaction objects associated with the user of the remote user device;
to generate a first data stream including the one or more interaction objects;
to transmit the first data stream to a remote user device;
to generate a second data stream including the remainder of the plurality of objects; and to transmit the second data stream to a remote user device.
Another aspect provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:
receiving a first data stream comprising one or more interaction objects, being a subset of a plurality of objects in a set of virtual content;
receiving a second data stream comprising the remainder of the plurality of objects in the set of virtual content;
rendering objects from both the first and second received data streams based on interaction data from a first user device indicative of a user interaction with the set of virtual content.
Another aspect provides an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
to receive a first data stream comprising one or more interaction objects, being a subset of a plurality of objects in a set of virtual content;
to receive a second data stream comprising the remainder of the plurality of objects in the set of virtual content; and to rendering objects from both the first and second received data streams based on interaction data from a first user device indicative of a user interaction with the set of virtual content.
Brief Description of the Drawings
The invention will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
Figure 1 is an example of an audio capture system which may be used in order to capture audio and/or video signals for processing in accordance with various examples described herein;
Figure 2 is a schematic diagram of a virtual reality processing apparatus in relation to one or more user devices and a communications network in accordance with various examples described herein;
Figures 3a - 3c are schematic views of a virtual space in which a plurality of users and other objects are presented, in different stages of interaction, in accordance with various examples described herein;
Figure 4 is a block diagram of a server and a user rendering device, for operating in accordance with various examples described herein;
Figure 5 is a flow diagram showing processing operations of a method performed at a provider of virtual content, in accordance with various examples described herein;
Figure 6 is a flow diagram showing more detailed processing operations of a method performed at a provider of virtual content, in accordance with various examples described herein;
Figures 7a and 7b are graphical representations of filter characteristics for modifying the orientation of objects in accordance with various examples described herein;
Figure 8a and 8b are schematic representations of down mixing gain characteristics for use in accordance with various examples described herein;
Figures 9a and 9b are schematic representations of processing pipelines for audio content in, respectively, a conventional method and a method in accordance with various examples described herein;
Figures 10a and 10b are schematic representations of processing pipelines for visual content in, respectively, a conventional method and a method in accordance with various examples described herein;
Figure 11 is a flow diagram showing processing operations of a method performed at a receiver of virtual content, in accordance with various examples described herein; and Figure 12 is a schematic view of components of either or both of the server and the user rendering device of Figure 4.
Detailed Description
In the description and drawings, like reference numerals refer to like elements throughout.
Figure 1 is an example of a capture system 1 which may be used in order to capture audio and video signals for processing in accordance with various examples described herein.
In this example, the capture system 1 comprises a spatial capture apparatus io configured to capture a spatial audio signal, and one or more additional audio capture devices 12A, 12B, 12C.
The spatial capture apparatus 10 comprises a plurality of audio capture devices 101A, B (e.g. directional or non-directional microphones) which are arranged to capture audio signals which may subsequently be spatially rendered into an audio stream in such a way that the reproduced sound is perceived by a listener as originating from at least one virtual spatial position. Typically, the sound captured by the spatial audio capture apparatus 10 is derived from plural different sound sources which may be at one or more different locations relative to the spatial capture apparatus 10. As the captured spatial audio signal includes components derived from plural different sounds sources, it may be referred to as a composite audio signal. Although only two audio capture devices 101A, B are visible in Figure 1, the spatial capture apparatus 10 may comprise more than two devices 101A, B. For instance, in some specific examples, the spatial capture apparatus 10 may comprise may comprise eight audio capture devices.
In the example of Figure 1, the spatial capture apparatus 10 is also configured to capture visual content (e.g. video) by way of a plurality of visual content capture devices 102A-G (e.g. cameras). The plurality of visual content capture devices 102A-G of the spatial audio capture apparatus 10 maybe configured to capture visual content from various different directions around the apparatus, thereby to provide immersive (or virtual reality content) for consumption by users. In the example of Figure 1, the spatial capture apparatus 10 is a presence-capture device, such as Nokia’s OZO camera, being an array of cameras and microphones. However, as will be appreciated, the spatial capture apparatus 10 may be another type of device and/or may be made up of plural physically separate devices. As will also be appreciated, although the content captured may be suitable for provision as immersive content, it may also be provided in a regular non-VR format for instance via a smart phone or tablet computer.
As mentioned previously, in the example of Figure 1, the spatial capture system 1 further comprises one or more additional audio capture devices 12A-C. Each of the additional audio capture devices 12A-C may comprise at least one microphone and, in the example of Figure 1, the additional audio capture devices 12A-C are lavalier microphones configured for capture of audio signals derived from an associated user 13A-C. For instance, in Figure 1, each of the additional audio capture devices 12A-C is associated with a different user by being affixed to the user in some way. However, it will be appreciated that, in other examples, the additional audio capture devices 12A-C may take a different form and/or may be located at fixed, predetermined locations within an audio capture environment. The additional audio capture devices 12A-C may be referred to as close-up microphones.
The locations of the additional audio capture devices 12A-C and/or the spatial capture apparatus 10 within the audio capture environment may be known by, or may be determinable by, the capture system 1 (for instance, a virtual reality processing apparatus 14). For instance, in the case of mobile capture devices/apparatuses, the devices/apparatuses may include location determination component for enabling the location of the devices/apparatuses to be determined. In some specific examples, a radio frequency location determination system such as Nokia’s High Accuracy Indoor Positioning may be employed, whereby the additional audio capture devices 12A-C (and in some examples the spatial capture apparatus 10) transmit messages for enabling a location server to determine the location of the additional audio capture devices within the audio capture environment. In other examples, for instance when the additional audio capture devices 12A-C are static, the locations may be pre-stored by an entity which forms part of the capture system 1 (for instance, the virtual reality processing apparatus 14). In some embodiments, the spatial capture system 1 may not include additional audio capture devices 12A-C.
In the example of Figure 1, the capture system 1 further comprises a virtual reality processing apparatus 14. The virtual reality processing apparatus 14 may be a server, or associated with a server, that generates the rendered virtual content and transmits it to users wishing to consume the virtual content through a user device such as a virtual reality headset. The virtual reality processing apparatus 14 may be configured to receive and store signals captured by the spatial capture apparatus 10 and/or the one or more additional audio capture devices 12A-C. The signals may be received at the virtual reality processing apparatus 14 in real-time during capture of the audio and video signals or may be received subsequently, for instance via an intermediary storage device. In such examples, the virtual reality processing apparatus 14 may be local to the audio capture environment or may be geographically remote from the audio capture environment in which the capture apparatus io and devices 12A-C are provided. In some examples, the virtual reality processing apparatus 14 may even form part of the spatial capture apparatus 10.
The audio signals received by the virtual reality processing apparatus 14 may comprise a multichannel audio input in a loudspeaker format. Such formats may include, but are not limited to, a stereo signal format, a 4.0 signal format, 5.1 signal format and a 7.1 signal format. In such examples, the signals captured by the system of Figure 1 may have been preprocessed from their original raw format into the loudspeaker format. Alternatively, in other examples, audio signals received by the virtual reality processing apparatus 14 maybe in a multi-microphone signal format, such as a raw eight channel input signal. The raw multimicrophone signals may, in some examples, be pre-processed by the virtual reality processing apparatus 14 using spatial audio processing techniques thereby to convert the received signals to loudspeaker format or binaural format. Down mixing may also be performed to limit the audio signal to, for example, four channel loudspeaker format.
In some examples, the virtual reality processing apparatus 14 maybe configured to mix the signals derived from the one or more additional audio capture devices 12A-C with the signals derived from the spatial capture apparatus 10. For instance, the locations of the additional audio capture devices 12A-C may be utilized to mix the signals derived from the additional audio capture devices 12A-C to the correct spatial positions within the spatial audio derived from the spatial capture apparatus 10. The mixing of the signals by the virtual reality processing apparatus 14 maybe partially or fully-automated.
The virtual reality processing apparatus 14 maybe further configured to perform (or allow performance of) spatial repositioning within the spatial audio captured by the spatial capture apparatus 10 ofthe sound sources captured by the additional audio capture devices 12A-C.
Spatial repositioning of sound sources maybe performed to enable future rendering in threedimensional space with free-viewpoint audio in which a user may choose a new listening position freely. Also, spatial repositioning maybe used to separate sound sources thereby to make them more individually distinct. Similarly, spatial repositioning may be used to emphasize/de-emphasize certain sources in an audio mix by modifying their spatial position. Other uses of spatial repositioning may include, but are certainly not limited to, placing certain sound sources to a desired spatial location, thereby to get the listeners attention (these may be referred to as audio cues), limiting movement of sound sources to match a certain threshold, and widening the mixed audio signal by widening the spatial locations of the various sound sources. Various techniques for performance of spatial repositioning are known in the art and so will not be in detail herein. One example of a technique which may be used involves calculating the desired gains for a sound source using Vector Base Amplitude Panning (VBAP) when mixing the audio signals in the loudspeaker signal domain.
Figure 2 is a schematic view of the virtual reality processing apparatus 14 in relation to a network 16, which may be an IP network such as the Internet, and a plurality of remote users 20A - 20C having respective virtual reality user headsets 22A - 22C for consuming virtual reality content. The virtual reality processing apparatus 14 may stream the virtual reality content over multiple transmission channels via the network 16. The remote users 20A 20C may be co-located or located in separate real-world locations, possibly in different countries. What each remote user 20A - 20C sees and/or hears through the video screens and/or headphones of their respective headsets 22A - 22C is part of a virtual space.
In some embodiments, the users 20A - 20C may interact with objects within the virtual space that is presented to them, whether audio, video or a combination of both. For example, the first user 20A may interact with the second user 20B, which may involve approaching the second user and speaking with them, making contact with them or exchanging an object with them. For example, the first user 20A may interact with one or more virtual objects within the virtual space. A virtual object maybe a computer-generated sound or video object, and may represent in virtual form a tangible object that can be touched or moved, or it may represent something non-tangible such as a web-page or service.
In the context of this specification, a virtual space is any computer-generated version of a space, for example the real world space captured using the capture system 1 shown in Figure 1, in which one or more users 20A - 20C can be immersed and may interact. In some example embodiments, the virtual space maybe entirely computer-generated, i.e. not captured. The headsets 22A - 22C may be of any suitable type. The headsets 22A - 22C may be configured to provide virtual reality video and/or audio content data to the respective users 20A - 20C. As such, the users may be immersed in virtual space.
The headsets 22A - 22C may receive the virtual reality content directly from the virtual reality processing apparatus 14, or, in some embodiments, from a separate media player 24 to which the headset is connected. For example, the media player 24 may include a games console, or a personal computer (PC) configured to receive visual and/or audio data from the virtual reality processing apparatus, via the network 16, and communicate this to the headset 22A shown in Figure 2. Alternatively, the media player 10 may form part of the headset 22A. In some examples, the media player 24 may comprise a mobile phone, smartphone or tablet computer configured to play content through its display.
The headsets 22A - 22C may include means for determining the spatial position of the respective users 20A - 20C and/or orientation of the respective user’s head. In some embodiments, therefore, the headsets 22A - 22C may therefore track movement in six degrees of freedom. Over successive time frames, a measure of movement may therefore be calculated and stored. For example, the headsets 22A - 22C may incorporate motion tracking sensors which may include one or more of gyroscopes, accelerometers and structured light systems. These sensors generate position data from which a current position and visual field-of-view (FOV), in other words a viewport, is determined and updated as the user, and so the headset 22A - 22C, changes position and/or orientation. The headsets 22A
- 22C may comprise gaze tracking means used to determine a direction of the user’s gaze, which can be used to determine an object of interest the user is looking at. The headsets 22A
- 22C may comprise, or be associated with, other limb tracking means to determine the position or orientation of a limb of the user.
The headsets 22A - 22C will typically comprise two digital screens for displaying stereoscopic video images of the virtual space in front of respective eyes of the user, and also two speakers for delivering audio. The headsets 22A - 22C may comprise one or more cameras. Images from the one or more cameras may be presented to the user through the screens of the headsets 22A - 22C such that the real world environment is displayed to the user in a “see-through mode”, or an augmented reality mode.
The example embodiments herein, which primarily relate to the delivery of virtual reality content, are not limited to a particular type of virtual reality headset 22A - 22C.
In some example embodiments, the headsets 22A - 22C, or one or more systems connected to the headsets, such as the media player 24, may determine the spatial position and/or orientation of the respective users 20A - 20C within the virtual space. These may include measurements of pitch, roll and yaw and also translational movement in Euclidean space along side-to-side, front-to-back and up-and-down axes (i.e. six degrees of freedom).
The headsets 22A - 22C, or one or more systems connected to the headsets, such as the media player 24, may be configured to display virtual reality content data to the headsets based on spatial position and/or the orientation of the respective headset. A detected change in spatial position and/or orientation, i.e. a form of movement, may result in a corresponding change in the visual and/or audio data to reflect a position or orientation transformation of the user 20A - 20C with reference to the space into which the visual and/or audio data is projected. This allows virtual reality content data to be consumed with the user 20A - 20C experiencing a three-dimensional (3D) virtual reality environment.
In the context of volumetric virtual reality spaces, this means that the user’s position can be detected relative to content provided within the volumetric virtual reality content, e.g. so that the user can move freely within a given virtual reality space, around individual objects or groups of objects, and can view the objects from different angles depending on the movement (e.g. rotation and location) of their head in the real world. In some examples, the user may also view and explore a plurality of different virtual reality spaces and move from one virtual reality space to another one.
Audio data maybe provided, and may represent spatial audio source content. Spatial audio may refer to directional rendering of audio in the virtual reality space such that a detected change in the user’s spatial position or in the orientation of their head may result in a corresponding change in the spatial audio rendering to reflect a transformation with reference to the space in which the spatial audio data is rendered.
The angular extent of the environment observable or hearable through the respective headsets 22A - 22C is called the visual field of view (FOV). The actual FOV observed or heard by a user depends on the inter-pupillary distance and on the distance between the lenses of the virtual reality headset 22A - 22C and the user’s eyes, but the FOV can be considered to be approximately the same for all users of a given display device when the virtual reality headset is being worn by the user. The portion of virtual reality content that is visible at a given time instant maybe called a viewport.
In some embodiments, it maybe possible for individual headsets 22A - 22C, or one or more systems connected to the headsets, such as the media player 24, to communicate certain types of data, such as interaction data to be described below, with one another in a point-topoint link (rather than via the virtual reality processing apparatus 14.) This may be applicable where two or more users 20A - 20C are located in the same cell of a cellular network or the same local area network (LAN.) It may therefore reduce latency, as will be appreciated from the description below.
Embodiments herein are concerned with interaction between one or more users, when immersed in a virtual space, and other virtual content, which can be one or both of visual and audio content. The virtual content may comprise a plurality of objects, which may represent captured items such as other users or physical objects and/or virtual items such as virtual users, computer-generated objects, web-pages etc. The objects may even be haptic objects or smell objects. Each object may be defined in the virtual content data at the virtual reality processing apparatus 14 in terms of its geometry (size, shape, colour, audio, haptic feedback, smell, other properties etc.) and its current, and potentially future, position within the virtual space. Accordingly, the virtual content may comprise virtual reality content.
Generally speaking, an interaction will occur when one user is within a predetermined distance of another object, when the user’s FOV covers another object, when the user’s movement is in the direction of another object, or a combination thereof. An interaction may also occur when the user makes a predetermined gesture in the direction of another object or touches (in the virtual sense) another object. An interaction may also occur when the user speaks or generates sound in response to audio from another object. An interaction may also occur based on predicted future movement, for example if it is predicted that the user will be within a predetermined distance of, or have their FOV cover, another object based on current movement trajectory.
The nature of interaction is that low latency is preferable to make the virtual experience appear or sound realistic. Thus, as one or more users 20A - 20C move within a virtual space, the rendering of objects based on the virtual content data, including new positions and interactions, will generally introduce latency if all mixing or rendering is network-based, that is it is performed entirely at the virtual reality processing apparatus 14. This is because of the round-trip latency which may introduce significant delay. There will be a finite period of time for the virtual reality processing apparatus 14 to receive the current position and/or FOV and/or interaction data from a user, to generate an updated stream of all re-positioned and/or interacted objects based on the change, and to transmit the generated stream to the relevant user headset. On the other hand, if all object data is streamed to the headsets 22A 22C, for local mixing or rendering, there is a high bandwidth requirement.
Embodiments herein therefore involve, at a provider of virtual content, receiving interaction data from a first remote user device, e.g. one of the headsets 22A - 22C, indicative of a user interaction with one or more objects of a set of virtual content. Then, based on the interaction data, a subset of the objects are determined as being one or more interaction object(s) associated with the user of the remote user device. A first data stream including the at least one interaction object(s) is generated and transmitted as a first data stream to a remote user device using, for example, a first transmission channel. A second data stream including the remainder of the plurality of objects is also generated and transmitted as a second data stream to the remote user device using, for example, a second, different, transmission channel. The use of different channels is however not-essential.
In this way, a subset of one or more objects that are being interacted with are transmitted separately from the remainder; the subset may be considered prioritised in that the one or more objects therein relate to a signalled interaction and therefore can be rendered locally at the relevant user device, e.g. one of the headsets 22A - 22C, separately from the remainder of the objects. For example, the first stream including the one or more interaction objects may be transmitted at a higher-bit rate than the second stream including the remainder of objects. The bit-rates may be adjusted to remain within an allowable overall bandwidth. The format of the one or more interaction objects may also be modified to ensure low latency at the relevant headset 22A - 22C.
Referring to Figures 3a - 3c, an overview of exemplary virtual reality interaction scenarios 31 are shown. For each, the left hand view is a perspective view and the right hand view is a top-plan view.
The examples of Figures 3a - 3c show a plurality of objects. Any of objects 34,36, 38,40 may have one or more representations in one or more modalities, for example one or more audio representations, one or more video representations, one or more haptic representations, and/or one or more olfactory representations. For simplicity, in the disclosed embodiments, a representation of an object is understood as one or more representations, which may also have different modalities. A representation refers to data that enables headset 22 or other user device to provide the desired experience to the user. For example, a video representation may comprise volumetric video data that allows a headset 22 to visualize an object regardless of the viewing direction and location of the user in the virtual reality space. As an example, a video representation may comprise scalable vector graphics (SVG), point cloud, and/or video frame data. An audio representation may comprise spatial audio data corresponding to a temporal duration. The representation may be delivered with compression coding and/or spatial coding. An audio representation may comprise a spatial audio representation, for example first order or high order Ambisonics, object-based audio, channel-based audio, or parametric immersive audio (e.g. a downmix to a stereo + spatial metadata).
A representation may be transmitted by virtual reality processing apparatus 14 in a continuous manner (a representation corresponding to a temporal duration) or a discrete manner (a representation corresponding to longer duration, where the audio representation is played repeatedly until a new representation arrives). Continuous transmission may for example include transmission of a substantially continuous stream of one or more representations to headset 22. Headset 22 may therefore render at least part of the received representation while receiving further data associated with the representation. A discrete transmission may comprise a representation corresponding to a temporal duration which is smaller than the period the said representation is valid. For example, a sound or a limited duration speech which is repeated everytime a user interacts with the interaction object. A representation may have an associated level of quality, which may be reflected in the amount of information, e.g. number of bits, needed for transferring the representation from virtual reality processing apparatus 14 to headset 22. Therefore, a bit-rate and/or a total number of bits of a continuous or a discrete transmission of a representation may be lower or higher depending on quality or other parameters of the representation. This enables selecting a more bit-rate efficient representation for an interaction object to reduce latency.
Figure 3a shows a virtual space 29 comprising a representation of a first user 30, a second user 32, and third to sixth virtual objects 34, 36, 38, 40, comprised of musicians / performers. They may take any form, however. The first and second users 30, 32 may be captured using for example the Figure 1 capture system 1, or they may be captured from separate locations. The third to sixth virtual objects 34, 36, 38, 40 may comprise any type of object, whether visual, audio, or a combination of both, and maybe computer-generated.
In some embodiments, a predetermined proximity zone may be associated with each object, including the first and second users 30, 32 and the virtual objects 34, 36,38,40. The proximity zones may be two-dimensional or three-dimensional volumetric zones, or a combination thereof, and have a predetermined area or volume. For example, one or more of the proximity zones may extend 2 metres around the position of the relevant object. In
Figures 3a - 3c, proximity zones are shown for the second user 32 and the virtual objects 34, 36,38,40 for ease of reference.
Figure 3b shows the virtual space 29 at a subsequent time (t+i) in which the first user 30 has moved towards the fifth virtual object 38, and is positioned within its respective proximity zone 38A.
In some embodiments, such positioning within said proximity zone 38A may be indicative of interaction between the first user 30 and the fifth virtual object 38. In respect of the first user 30, the fifth virtual object 38 is therefore determined as an interaction object, or object of interest (OOI.)
In other embodiments, a FOV based on orientation or gaze direction which covers said fifth virtual object 38 may be indicative of interaction between the first user 30 and the fifth virtual object 38.
In other embodiments, a movement direction which is towards said fifth virtual object 38 may be indicative of interaction between the first user 30 and the fifth virtual object 38.
In other embodiments, a prediction based on prior movement or a prior FOVmaybe indicative of a predicted interaction between the first user 30 and the fifth virtual object 38.
A combination of the above may be used to determine interaction between the first user 30 and the fifth virtual object 38, or indeed any other object or combination of objects.
Therefore, in accordance with embodiments, the virtual reality processing apparatus 14 may determine the above one or more OOI, being a subset of all of the objects, for the first user 30 based on, in this case, the positional data received from the user’s associated headset, or other position determining apparatus. Rather than the virtual reality processing apparatus 14 generating an updated stream of data comprising each relative position and the video and/or audio representation of the objects 30,32,34, 36,38,40 from the first user’s new perspective, a first stream is created including the relative position and the video and/or audio representation of the prioritised content, i.e. the fifth virtual object 38. Another, second stream may comprise the remainder of the objects 30,32, 34,36,40. The first and second streams may be processed and transmitted separately in different channels. In practise, this may involve taking an existing stream, and substantially removing the interaction object (OOI) components therefrom which are placed in the first stream.
The first stream maybe transmitted at a higher bit-rate than the second stream. For example, the first stream may be transmitted with different transmission parameters such as for example a higher amount of allocated data resources, a higher symbol rate, a higher order of modulation, and/or a lower forward error correction code rate. This avoids latency issues when the OOI (in this case the fifth virtual object 38) is rendered at the user end, i.e. at the headset of the first user 30. Other processing, e.g. positioning filtering, gain crossfading, distance rendering modification etc. maybe performed by the virtual reality processing apparatus 14 in order to reduce jumping artefacts that may result from treating the prioritised object(s) differently from the remaining objects when combined at the headset or media player associated with the first user 30, in this example. It is also appreciated that methods described herein are beneficial even if the first and second streams were transmitted at the same bit rate. This is because processing of the interaction objects separately causes a shorter delay before transmission compare to down mixing of remaining objects.
At the user end, such as at the headset or media player associated with the first user 30, both the first and second streams are rendered to produce combined audio and/or visual content.
In some embodiments, for example when the one or more OOI are predicted based on a prior movement or prior FOV, the first stream may be pre-buffered at the headset or media player associated with the first user 30 to enable a real-time object interaction response. The pre-buffering stores the OOI representation so that when the predicted interaction actually happens, or is detected, the pre-buffered representation is utilised and low latency information is used to render the OOI. This may be especially useful for visual models where more data is involved. The pre-buffered representation may comprise necessary information for rendering the OOI in response to determining one or more parameters associated with an interaction event. For example, the pre-buffered representation may comprise visual and/or auditory representation from different directions towards the OOI such that the object can be locally rendered at the media player in response to detecting a moving rotating interaction by user 30.
Figure 3c shows the virtual space 29 at a subsequent time (t+2) in which the first user 30 has moved towards the second user 32, and is positioned within its respective proximity zone 32A.
Similar to Figure 3b, such positioning within said proximity zone 32A may be indicative of interaction between the first user 30 and the second user 32. In respect of the first user 30, the second user 32 is therefore an object of interest (OOI) and the same considerations outlined above for the virtual reality processing apparatus 14 and the user-end headset or media player apply. Indeed, corresponding processing may occur in respect of the second user 32 interacting with the first user 30.
In the context of virtual audio content, the virtual reality processing apparatus 14 may store a six degrees-of-freedom volumetric audio representation based on, for example, the audio capture devices 101A, 101B of the spatial capture apparatus 10 shown in Figure 1, as well as the additional audio capture apparatuses 12A, 12B, 12C shown in Figure 1. Computergenerated audio may also be provided. The stored volumetric audio may be down mixed to loudspeaker format. As will be appreciated, loudspeaker format comprises four audio channels, irrespective of the number of objects represented. This therefore limits the overall bandwidth but nevertheless latency issues remain due to the round trip time delays in receiving relative position information of all objects at the virtual reality processing apparatus 14 and sending back to the relevant user device the full down mix. In embodiments herein, the second stream may be considered a partial scene down mix, in that the one or more interaction OOI may be substantially separated or excluded from the down mix using known techniques, and instead included in the first stream for low latency.
Referring back to Figure 1, sound separation may for example involve using the close-up audio signal from an individual audio capture apparatuses 12A, 12B, 12C and the spatial audio from the spatial capture apparatus 10 to calculate a room impulse response (RIR) from each source to the spatial capture apparatus 10. The RIR maybe applied to each close-up audio signal to create a so-called wet signal which represents the audio as it would sound from the position and listening direction of the spatial capture apparatus 10. These wetversions can be selectively subtracted from the spatial audio from the spatial capture apparatus 10 to generate the partial scene down mix as mentioned above.
Figure 4 is a generalised block diagram of a server 40 and a user rendering device 42, which may respectively represent the virtual reality processing apparatus 14 and the media player or headset associated with a remote user such as the first user 30 shown in Figures 3a - 3c. The server 40 stores, or has access to, the virtual content data representing a virtual space comprised of a plurality of objects, whether captured real objects or virtual objects. The user rendering device 42 comprises one or more sensors 44 for indicating one or more of the user’s position, orientation, gaze direction, movement, gestures and other interaction data indicative of how the user is interacting when consuming the represented virtual space. The interaction data is transmitted by the user rendering device 42 to the server 40 over an uplink channel 45. At the server 40, a mixing module 46 performs six degrees-of-freedom mixing of the objects based on the received interaction data in accordance with methods described herein. As mentioned above, this involves determining a subset of the objects as interaction objects, or OOIs, and generating a first data stream representing the OOIs and a second data stream representing the remainder of objects. The first data stream is transmitted in a first downlink channel 47 and the second data stream is transmitted in a second downlink channel 48. The first data stream maybe transmitted at a higher data rate than the second data stream. At the user rendering device 42, a rendering module 49 may decode and render the first data stream and the second data stream separately and the rendered objects from each stream are output to either or both of a visual display system 50 and an audio output system 51 forming part of the user rendering device 42. Decoding may be performed separately for the separate streams, or a jointly coded stream may be provided. Rendering may comprise separate rendering of the streams, which are then summed, or the streams can be rendered using the same Tenderer.
Figure 5 is a flow diagram illustrating, in accordance with one embodiment, processing operations that may be performed by the server 40, for example by software, hardware or a combination thereof, when run by a processor of the server. Certain operations may be omitted, added to, or changed in order. Numbering of operations is not necessarily indicative of processing order.
A first operation 5.1 may comprise providing virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects.
A further operation 5.2 may comprise receiving interaction data from a first remote user device indicative of a user interaction with virtual content.
A further operation 5.3 may comprise determining based on the interaction data a subset of the objects as being one or more interaction object(s) associated with the user of the remote user device.
A further operation 5.4 may comprise generating a first data stream including the at least one interaction object(s).
A further operation 5.5 may comprise transmitting the first data stream to a remote user device, e.g. using a first transmission channel.
A further operation 5.6 may comprise generating a second data stream including the remainder of the plurality of objects.
A further operation 5.7 may comprise transmitting the second data stream to a remote user device, e.g. using a second, different, transmission channel.
Operations 5.4 and 5.5 may be performed in parallel to operations 5.6 and 5.7, but this is not essential.
Figure 6 is a flow diagram illustrating, in accordance with another embodiment, processing operations that may be performed by the server 40, for example by software, hardware or a combination thereof, when run by a processor of the server. Certain operations may be omitted, added to, or changed in order. Numbering of operations is not necessarily indicative of processing order.
The operations 6.1 - 6.5 generally correspond to operations 5.1 - 5.4 and 5.6 respectively. Operations 6.6 - 6.8 are further operations which maybe additionally employed, individually, or in combination.
In this embodiment, a further operation 6.6 of bit-rate equalisation based on the available bit-rate may be performed. The aim of this operation is to vary the encoding of the first and second downlink streams based on the available bit-rate or bandwidth to ensure that sufficient bandwidth is available to the first downlink stream carrying the prioritised OOI data. The bit-rate of the second downlink stream may be arranged to be lower than that of the full down mix downlink stream. This may be needed to ensure sufficient quality of the prioritized interaction content without increasing the overall bitrate (in case of bandwidth constrained network).
A further operation 6.7 of position filtering and/or gain filtering maybe performed. The aim here is to avoid jumping artefacts to appear when the first and second data streams are rendered at the user rendering device 42. In this respect, position filtering may involve using position filter parameters such as a filter angle and orientation. Figure 7a shows an example position filter for smoothing the amount of orientation change between an object and its corresponding change in the down mix. Figure 7b shows an example position filter for smoothing the amount of orientation change between an object in the down mix and its corresponding change when not in the down mix. Figures 7a and 7b illustrate the curve for transforming the orientation of the interaction object while merging and demerging it from the down mix. The filter parameters are delivered to the user rendering device to perform the position filtering.
This positon filtering may effectively smooth the impact of rendering when an object switches from one stream to another, for example from an audio scene down mix in the second stream to the prioritised object first stream. The saturation point and the filter angle decide the orientation after filtering. This may be useful because the position of the first stream is expected to have less latency compared to the second stream.
In some embodiments, an object orientation information track may also be delivered for representing a visual object (appropriately facing towards the other objects) during the switch from the second stream to the first stream.
Position filtering for audio and/or visual rendering may also be useful when a user moves away from an OOI. The server 40 may detect that an interaction OOI can be included back to the second stream down mix and therefore signals the information about the object/channel to the mixing module 46. Subsequently, the interaction OOI maybe merged back into the scene down mix rendering. The distance rendering effect for audio is often made more pronounced when it is nearer the OOI, and consequently small changes in object position (in the scene down mix versus individual object rendering) can be more pronounced. Accordingly, the distance rendering audio gain latency may be transitioned with a smoothing filter while moving from the full down mix in the second stream to interaction OOI rendering in the first stream, or vice versa.
For some object types, modification of the timeline at which objects in the respective first and second streams are rendered may be permissible to exploit lower latency interaction, for example by leveraging low latency positioning as well as content from the interaction object. This may introduce the aforementioned jumping effect in the content of the second stream, e.g. the down mix, and the one or more interaction object(s) in the first stream, when received at the user rendering device 42. This may be a more critical issue for certain types of audio content than for visual content (due to the greater sampling frequency of audio compared to video.) For example, any sample skip in a concert or singing scenario would have an adverse impact on user experience, but may not be critical for synthetic objects (triggered only during interaction) or for conversational audio between users.
In one embodiment, we propose the following method to make the system cater for such scenarios.
A flag, referred to here as a timeline-modifiability flag, may be signalled for certain objects, e.g. audio objects, in the virtual scene. The timeline-modifiability flag indicates to the user rendering device 42 the possibility of skipping one or more samples in order to leverage full latency reduction in terms of positioning information, as well as that of the one or more interaction objects.
Two cases relating to use of the timeline-modifiability flag will now be described, by way of example.
Case 1: timeline-modifiability flag is set to TRUE
The server 40 may operate as described above and the content switches from the down mix to the partial down mix, plus the interaction object rendering. As has been explained, this will result in latency reduction for the position of the interaction object as well as the audio from the interaction object.
Timeline-modifiability flag is set to FALSE
The server 40 may signal a timestamp (Tde-merge) to start de-merging of the interaction object as well as the duration of a cross-fade transition (ATde-merge), for which see Figure 8a. During the ATde-merge period, the gain of the de-merged object goes gradually to zero in the second, down mix stream and, at the same time, the gain of the interaction object in the first stream goes from zero to whatever is the final desired gain. According to an embodiment, the gradual increment or decrement is non-linear. The user rendering device 42 may need to wait for the Tde-merge timestamp in the content, and may subsequently perform a gain change for the interaction object as described earlier. This approach results in a constant timeline, but improved user interaction experience by leveraging low latency positioning information. The merging of the interaction object is implemented in a similar manner, for which see Figure 8b.
A further operation 6.8 may comprise performing modification of distance rendering to compensate for positioning latency between a single audio object and the position in the down mix. The distance rendering refers to modifying the perceived distance in accordance with the change in the listener position with respect to the audio object positions. Due to positioning latency, an object that is further away in the down mix might in reality have come up close to the listening point. The operation in 6.8 modifies the perceived distance impact to smoothen the difference in the distance perceived for an audio source in the down mix (second downlink stream) and interaction stream (first downlink stream).
Figure 9a summarises a conventional audio scene delivery pipeline. The period Dhsi represents the time required for the server 40 to receive from the user rendering device 42 the position information PDownmix for the full down mix, i.e. the complete set of data representing the scene from the new position of the user rendering device 42. The period Ddm represents the time required at the server 40 to generate the full down mix. The period Dhs2 represents the time required for the user rendering device 42 to receive the full down mix from the server 40. The period Dbi represents the time required to render, e.g. binauralize the audio scene from the full down mix.
The pipeline time or delay maybe summarised as:
D-Total-Downmix = DhSi + Ddm + DhS2 + Dfii.
Figure 9b summarises an audio scene delivery pipeline according to an embodiment. Based on the interaction objects signalled to the user rendering device 42, which takes a shorter time to render at the user rendering device, the server generates relatively quickly the the one or more interaction objects (as one or more streams), which is then transmitted during period DHS4- A second data stream comprising the remaining objects is also generated, and transmitted during period DBi. Thus, the one or more interactions objects can be rendered with less latency (since they can leverage the interaction object positions directly from the user rendering device).
Here, the pipeline time or delay may be summarised as:
DTotal-Prioritised — DhS4 + Dbi.
Overall, it will be appreciated that DTotai-Downmix » DTotai-prioritised.
Figures loa and lob are similar delivery pipelines for visual scene delivery, the former relating to a conventional method and the latter to methods according to an embodiment. The same or similar principles apply. Individual visual object information, which may comprise the representation as well as position, is transmitted to the server 40 which creates a 3D volumetric model of the volume of interest. Scene voxelization, i.e. creating 3D models consisting of voxels, delay may be affected by positioning system delay if the voxelization process utilizes OOI positioning information. The relevant voxels are delivered to the user rendering device 42 depending on the user’s viewpoint. Delay components are dominated by the full screen delivery, due to high bandwidth requirements, and the scene stitching and voxelization latency. In Figure 10a, DPL refers to positioning latency delay, Dsv to scene voxelization delay, Dfsd to full scene delivery delay, and Dvr to viewpoint rendering delay.
The pipeline time or delay maybe summarised as:
D-Total-Seene = Dpl + Dsv + DfSD + Dvr.
In Figure 10b, Dio refers to interaction object delay. The visual object rendering is performed based on the visual representation received in the interaction stream and the position is obtained from the rendering device. This reduces the positioning latency for the interaction object. Thus, the one or more interactions objects can be rendered with less latency.
The pipeline time or delay maybe summarised as:
DTotal-Prioritised = Dio + Dvr
Overall, it will be appreciated that DTotai-seenex » DTotai-prioritised.
In Figures 9b and 10b, the interaction prioritisation signal may comprise information regarding one or more of the interaction objects demerged from the partial down mix, the time stamp relative to the full down mix, and the timeline modifiability flag. In some embodiments, if a visual object or audio object is a 3D model or finite time audio content, this information maybe sent in the signal.
As mentioned above, in some embodiments, the first stream, i.e. the low-latency interaction stream comprising the subset of interaction objects, maybe transmitted over a point-topoint link with the intended receiver object, i.e. the object with which an interaction is taking place, rather than via a network, such as the network 16 shown in Figure 2. This offers particular advantages in terms of reducing latency further by avoiding delays associated with the network 16, in situations where two interacting objects are perhaps real-worid objects located in the same cell or a cellular network or local network, e.g. a LAN or WLAN.
In such an embodiment, the method outlined in Figure 5 maybe modified such that step 5.2 further comprises, or a further step comprises, receiving connectivity information for real objects in the interaction data, comprising one or more of the recipient address, port, time stamp and object identifier. Then, a further step of signalling the interaction object delivery information to the sender object, e.g. containing one or more of the recipient address, port, time stamp and object identifier. This is to enable the sending object, e.g. the first user 30 shown in Figure 3, to transmit data directly to the second user 32 over a point-to-point link. Similar modification may be applied to the more detailed method outlined in Figure 6.
Figure 11 is a flow diagram illustrating, in accordance with one embodiment, processing operations that may be performed by the user rendering device 42 shown in Figure 2, for example by software, hardware or a combination thereof, when run by a processor of the user rendering device. Certain operations maybe omitted, added to, or changed in order. Numbering of operations is not necessarily indicative of processing order.
A first operation 11.1 comprises receiving from a first data channel a first data stream comprising one or more interaction object(s), being a subset of a plurality of objects in a set of virtual content.
Another operation 11.2 comprises receiving from a second data channel a second data stream comprising the remainder of the plurality of objects in the set of virtual content.
Another operation 11.3 comprises rendering objects from both the first and second received data streams based on interaction data from a first user device indicative of a user interaction with the set of virtual content.
Figure 12 is a schematic diagram of components of either of the server 40 or the user rendering device 42 shown in Figure 4. For ease of explanation, we will assume that the components are those in the server 40, but it will be appreciated that the following is applicable to the user rendering device 42.
The server 40 may have a processor 100, a memory 104 closely coupled to the processor and comprised of a RAM 102 and ROM 103, and, optionally, hardware keys 106 and a display 108. The server 40 may comprise one or more network interfaces 110 for connection to a network, e.g. a modem which maybe wired or wireless.
The processor 100 is connected to each of the other components in order to control operation thereof.
The memory 104 may comprise a non-volatile memory, a hard disk drive (HDD) or a solid state drive (SSD). The ROM 103 of the memory 104 stores, amongst other things, an operating system 112 and may store software applications 114. The RAM 102 of the memory 104 may be used by the processor 100 for the temporary storage of data. The operating system 112 may contain code which, when executed by the processor ware components of the server 40.
The processor 100 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors and it may comprise processor circuitry.
The server 40 may be a standalone computer, a server, a console, or a network thereof.
In some embodiments, the server 40 may also be associated with external software applications. These maybe applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications. The server 40 may be in communication with the remote server device in order to utilize the software application stored there.
For the avoidance of doubt, references to virtual reality (VR) are also intended to cover related technologies such as augmented reality (AR) and mixed reality (MR.)
Thus, in some embodiments, there are described methods and systems focussed on delivery 5 rather than capture, in which a partial scene down mix and one or more interaction object are delivered in separate streams, which enables faster rendering than the situation where a full down mix is provided.
It will be appreciated that the above described embodiments are purely illustrative and are 10 not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.
Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed 15 herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.
Claims (36)
1. A method comprising:
providing virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects;
receiving interaction data from a first remote user device indicative of a user interaction with one or more objects in the virtual content;
determining based on the interaction data a subset of the objects as being one or more interaction objects associated with the user of the remote user device;
generating a first data stream including the one or more interaction objects; transmitting the first data stream to a remote user device;
generating a second data stream including the remainder of the plurality of objects; and transmitting the second data stream to a remote user device
2. The method of claim 1, wherein the first and second data streams are transmitted using different, first and second transmission channels.
3. The method of claim 1 or claim 2, wherein determining the subset of the objects as being one or more interaction objects is based on the distance of the user and the subset of the objects, indicated by the interaction data.
4. The method of claim 3, wherein the subset of objects comprises one or more of the objects indicated by the interaction data as being within a predetermined distance of the user.
5. The method of any preceding claim, wherein determining the subset of the objects as being one or more interaction objects is based on a viewing direction of the user relative to the subset of the objects, indicated by the interaction data.
6. The method of any preceding claim, wherein determining the subset of the objects as being one or more interaction objects is based on a direction of movement of the user, indicated by the interaction data.
7. The method of any preceding claim, wherein determining the subset of the objects as being one or more interaction objects is based on using the interaction data to predict a future interaction between the user of the first remote user device and the subset of the objects.
8. The method of any preceding claim, wherein the first data stream is transmitted at a higher bit-rate than the second data stream.
9. The method of claim 8, wherein the first and second data streams are transmitted at variable bit-rates based on available bit-rate.
10. The method of any preceding claim, further comprising performing position filtering of one or more of the interaction objects in the first data stream prior to transmitting.
11. The method of any preceding claim, further comprising performing gain cross-fading of one or more of the interaction objects in the first data stream prior to transmitting.
12. The method of any preceding claim, wherein the first and second data streams comprise audio data, and wherein generating the second data stream comprises substantially removing audio components corresponding to the one or more interaction objects from the first data stream.
13. The method of any preceding claim, wherein the first and second data streams are transmitted to the first remote user device.
14. The method of any preceding claim, wherein the first and second data streams are transmitted to one or more different remote user devices as from the first remote user device where the interaction data was received.
15. The method of any preceding claim, wherein the interaction data from the first remote user device comprises position data associated with the one or more interaction objects, and the method further comprises transmitting the position data to one or more different remote user devices as from the first user device from where the interaction data was received.
16. The method of any preceding claim, wherein the subset of interaction objects comprises one or more of an audio object, a visual object, a haptic object and a smell object.
17· The method of any preceding claim, further comprising, prior to transmitting the first and second data streams, transmitting transitions streams over both the first and second transmission channels which both comprise the subset of interaction objects.
18. A method comprising:
receiving a first data stream comprising one or more interaction objects, being a subset of a plurality of objects in a set of virtual content;
receiving a second data stream comprising the remainder of the plurality of objects in the set of virtual content; and rendering objects from both the first and second received data streams based on interaction data from a first user device indicative of a user interaction with the set of virtual content.
19. The method of claim 18, wherein the first and second data streams are received from different, first and second transmission channels.
20. The method of claim 18 or claim 19, wherein rendering objects is based on interaction data indicative of the distance of the user and the subset of the objects.
21. The method of claim 20, wherein the subset of objects comprises one or more of the objects indicated by the interaction data as being within a predetermined distance of the user.
22. The method of any of claims 18 to 21, wherein rendering objects is based on interaction data indicative of a viewing direction of the user relative to the subset of the objects.
23. The method of any of claims 18 to 22, wherein rendering objects is based on interaction data indicative of a direction of movement of the user.
24. The method of any of claims 18 to 23, further comprising buffering data from the first data stream for the subsequent rendering operation based on the interaction data being indicative of a predicted future interaction with the one or more interaction objects.
25. The method of any of claims 18 to 24, wherein the first data stream is received at a higher bit-rate than the second data stream is received.
26. The method of any of claims 18 to 25, performed at the first remote user device.
27. The method of any of claims 18 to 25, performed at one or more different remote user devices as from the first user device.
28. The method of any of claims 18 to 27, wherein the interaction data from the first user device comprises position data associated with the one or more interaction object(s), and the method further comprises receiving the position data at one or more different remote user devices as from the first user device.
29. The method of any of claims 18 to 28, wherein the subset of interaction objects comprises one or more of an audio object, a visual object, a haptic object and a smell object.
30. The method of any of claims 18 to 29, further comprising, prior to receiving the first and second data streams, receiving transitions streams which both comprise the subset of interaction object(s).
31. A computer program comprising instructions that when executed by a computer program control it to perform the method of any preceding claim.
32. An Apparatus configured to perform the method of any of claims 1 to 30.
33. A non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:
providing virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects;
receiving interaction data from a first remote user device indicative of a user interaction with virtual content;
determining based on the interaction data a subset of the objects as being one or more interaction objects associated with the user of the remote user device;
generating a first data stream including the one or more interaction objects; transmitting the first data stream to a remote user device;
generating a second data stream including the remainder of the plurality of objects; and transmitting the second data stream to a remote user device.
34. An apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
to provide virtual content for transmission to one or more remote user devices, the virtual content comprising a plurality of objects;
to receive interaction data from a first remote user device indicative of a user interaction with virtual content;
to determine based on the interaction data a subset of the objects as being one or more interaction objects associated with the user of the remote user device;
to generate a first data stream including the one or more interaction objects;
to transmit the first data stream to a remote user device;
to generate a second data stream including the remainder of the plurality of objects; and to transmit the second data stream to a remote user device.
35. A non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:
receiving a first data stream comprising one or more interaction objects, being a subset of a plurality of objects in a set of virtual content;
receiving a second data stream comprising the remainder of the plurality of objects in the set of virtual content;
rendering objects from both the first and second received data streams based on interaction data from a first user device indicative of a user interaction with the set of virtual content.
36. An apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
to receive a first data stream comprising one or more interaction objects, being a subset of a plurality of objects in a set of virtual content;
to receive a second data stream comprising the remainder of the plurality of objects in the set of virtual content; and to rendering objects from both the first and second received data streams based on interaction data from a first user device indicative of a user interaction with the set of virtual content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1719560.3A GB2568726A (en) | 2017-11-24 | 2017-11-24 | Object prioritisation of virtual content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1719560.3A GB2568726A (en) | 2017-11-24 | 2017-11-24 | Object prioritisation of virtual content |
Publications (3)
Publication Number | Publication Date |
---|---|
GB201719560D0 GB201719560D0 (en) | 2018-01-10 |
GB2568726A true GB2568726A (en) | 2019-05-29 |
GB2568726A8 GB2568726A8 (en) | 2019-06-19 |
Family
ID=60950595
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1719560.3A Withdrawn GB2568726A (en) | 2017-11-24 | 2017-11-24 | Object prioritisation of virtual content |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2568726A (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413109A (en) * | 2019-06-28 | 2019-11-05 | 广东虚拟现实科技有限公司 | Generation method, device, system, electronic equipment and the storage medium of virtual content |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140267564A1 (en) * | 2011-07-07 | 2014-09-18 | Smart Internet Technology Crc Pty Ltd | System and method for managing multimedia data |
WO2017030985A1 (en) * | 2015-08-14 | 2017-02-23 | Pcms Holdings, Inc. | System and method for augmented reality multi-view telepresence |
-
2017
- 2017-11-24 GB GB1719560.3A patent/GB2568726A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140267564A1 (en) * | 2011-07-07 | 2014-09-18 | Smart Internet Technology Crc Pty Ltd | System and method for managing multimedia data |
WO2017030985A1 (en) * | 2015-08-14 | 2017-02-23 | Pcms Holdings, Inc. | System and method for augmented reality multi-view telepresence |
Also Published As
Publication number | Publication date |
---|---|
GB2568726A8 (en) | 2019-06-19 |
GB201719560D0 (en) | 2018-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111466124B (en) | Method, processor system and computer readable medium for rendering an audiovisual recording of a user | |
CN110832883B (en) | Mixed Order Ambisonics (MOA) audio data for computer mediated reality systems | |
US11055057B2 (en) | Apparatus and associated methods in the field of virtual reality | |
TWI595785B (en) | Apparatus and method for screen related audio object remapping | |
KR102517906B1 (en) | Method, apparatus and system for optimizing communication between transmitter and receiver in computer-assisted reality applications | |
CN111466122A (en) | Audio delivery optimization for virtual reality applications | |
JP6622388B2 (en) | Method and apparatus for processing an audio signal associated with a video image | |
CN112673649B (en) | Spatial audio enhancement | |
US20210112361A1 (en) | Methods and Systems for Simulating Acoustics of an Extended Reality World | |
KR102566276B1 (en) | Parameters for overlay processing for immersive teleconferencing and telepresence for remote terminals | |
CN111492342B (en) | Audio scene processing | |
JP7457525B2 (en) | Receiving device, content transmission system, and program | |
WO2019034804A2 (en) | Three-dimensional video processing | |
GB2568726A (en) | Object prioritisation of virtual content | |
KR20210056414A (en) | System for controlling audio-enabled connected devices in mixed reality environments | |
US20220386060A1 (en) | Signalling of audio effect metadata in a bitstream | |
US11696085B2 (en) | Apparatus, method and computer program for providing notifications | |
CN113632496A (en) | Associated spatial audio playback | |
JP7505029B2 (en) | Adaptive Audio Delivery and Rendering | |
JP2021129129A (en) | Information processing device, information processing method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |