CN117397243A - Streaming scene prioritizer for immersive media - Google Patents

Streaming scene prioritizer for immersive media Download PDF

Info

Publication number
CN117397243A
CN117397243A CN202380011136.XA CN202380011136A CN117397243A CN 117397243 A CN117397243 A CN 117397243A CN 202380011136 A CN202380011136 A CN 202380011136A CN 117397243 A CN117397243 A CN 117397243A
Authority
CN
China
Prior art keywords
scene
media
asset
assets
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202380011136.XA
Other languages
Chinese (zh)
Inventor
保罗·斯潘塞·道金斯
阿芮亚娜·汉斯
史蒂芬·文格尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent America LLC
Original Assignee
Tencent America LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/137,849 external-priority patent/US20230370666A1/en
Application filed by Tencent America LLC filed Critical Tencent America LLC
Publication of CN117397243A publication Critical patent/CN117397243A/en
Pending legal-status Critical Current

Links

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Aspects of the present disclosure provide methods and apparatus for immersive media processing. In some examples, a method of media processing includes receiving scene-based immersive media for play on a light field-based display. Scene-based immersive media includes a plurality of scenes. The method includes assigning priority values to a plurality of scenes in the scene-based immersive media, respectively, and determining an ordering of streaming the plurality of scenes to the terminal device based on the priority values. In some examples, a method of media processing includes receiving scene-based immersive media for play on a light field-based display. A scene in the scene-based immersive media includes a first ordered plurality of assets. The method includes determining a second ordering for streaming the plurality of assets to the terminal device, the second ordering being different from the first ordering.

Description

Streaming scene prioritizer for immersive media
Cross Reference to Related Applications
The present application claims the priority benefit of U.S. patent application Ser. No. 18/137,849, entitled "STREAMING SCENE PRIORITIZER FOR IMMERSIVE MEDIA (streaming scene PRIORITIZER FOR immersive media)" filed on month 21 of 2023, which claims the priority benefit of the following U.S. provisional patent applications: U.S. provisional application 63/341,191 entitled "STREAMING SCENE PRIORITIZER FOR LIGHTFIELD HOLOGRAPHIC MEDIA (streaming scene PRIORITIZER FOR light field holographic media)" filed on month 5 of 2022, U.S. provisional application 63/342,532 entitled "STREAMING SCENE pritizer FOR LIGHTFIELD, HOLOGRAPHIC AND/OR IMMERSIVE MEDIA (streaming scene PRIORITIZER FOR light field, holographic AND/OR immersive media)" filed on month 5 of 2022, U.S. provisional application 63/342,532 entitled "STREAMING ASSET pritizer FOR LIGHTFIELD, HOLOGRAPHIC AND/OR IMMERSIVE MEDIA BASED ON ASSET SIZE (streaming asset PRIORITIZER FOR light field, holographic AND/OR immersive media based on asset size)" filed on month 5 of 2022, U.S. provisional application 63/344,907 entitled "STREAMING ASSET PRIORITIZER FOR LIGHTFIELD, HOLOGRAPHIC, OR IMMERSIVE MEDIA BASED ON ASSET VISIBILITY (streaming asset PRIORITIZER FOR light field, holographic, OR immersive media based on asset visibility)" filed on month 5, 23 of 2022, U.S. provisional application 63/354,071 entitled "PRIORITIZING STREAMING ASSETS FOR LIGHTFIELD, HOLOGRAPHIC AND/OR IMMERSIVE MEDIA BASED ON MULTIPLE POLIICIES (PRIORITIZING streaming asset FOR light field, holographic, AND/OR immersive media based on multiple policies) filed on month 21 of 2022, AND entitled" SCENE ANALYZER FOR PRIORITIZATION OF ASSET RENDERING BASED ON ASSET VISIBILITY IN SCENE DEFAULT VIEWPORT (based on asset visibility in scene default viewport), A scene analyzer FOR prioritization OF asset rendering), "U.S. provisional application 63/355,768, filed ON day 10, month 14, 2022, entitled" STREAMING ASSET priority FOR use FOR streaming asset PRIORITIZER FOR light field, holographic OR immersive media, "U.S. provisional application 63/416,390, entitled" STREAMING ASSET priority FOR use FOR streaming asset PRIORITIZER FOR light field, holographic OR immersive media, "filed ON day 10, month 14, 2022, entitled" STREAMING ASSET priority FOR use FOR streaming asset PRIORITIZER LIGHTFIELD, HOLOGRAPHIC OR IMMERSIVE MEDIA BASED ON ASSET COMPLEXITY, entitled "35 priority FOR use FOR light field, holographic OR immersive media," filed ON day 11, month 3, 2022, entitled "STREAMING ASSET priority FOR use FOR streaming asset PRIORITIZER LIGHTFIELD, HOLOGRAPHIC, OR IMMERSIVE MEDIA BASED ON MULTIPLE METADATA ATTRIBUTES (BASED ON a plurality OF metadata attributes, asset PRIORITIZER FOR use FOR light field, holographic OR immersive media), and" additional priority type OF application 63/416,175, filed ON day 29, 2022, month 29, entitled "priority FOR streaming asset PRIORITIZER FOR use FOR light field, holographic OR immersive media," are applicable to "additional types OF priority FOR use OF" application types OF "applied" 63/416,395. The entire disclosures of these prior applications are incorporated herein by reference in their entirety.
Technical Field
The present disclosure describes embodiments generally related to media processing and distribution, including adaptive streaming of immersive media, such as for light field displays or holographic immersive displays.
Background
The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Immersive media generally refers to media that stimulates any or all of the human sensory systems (visual, auditory, somatosensory, olfactory, and possibly gustatory) to create or enhance perception that a user is physically present in the media experience, such as media beyond that distributed over existing commercial networks for timed two-dimensional (2D) video and corresponding audio (also referred to as "traditional media"). Both immersive and legacy media can be classified as timed and non-timed.
Timed media refers to media that is organized and presented according to time. Examples include movie features, news stories, episode content, etc., organized by time period. Conventional video and audio are generally considered to be timed media.
Non-timed media refers to media that is not organized by time, but rather by logical, spatial, and/or temporal relationships. One example includes a video game in which a user may control an experience created by a gaming device. Another example of non-timed media is still image photographs taken by a camera. Non-timed media may incorporate timed media into a continuous loop audio or video clip of a scene, such as a video game. In contrast, the timing medium may incorporate non-timing media, such as video with a fixed still image as a background.
A device with immersive media capabilities may be a device that is equipped with the ability to access, interpret, and render the immersive media. Such media and devices are heterogeneous in terms of the amount and format of the media and the amount and type of network resources required to distribute such media on a large scale (i.e., to achieve a distribution equivalent to conventional video and audio media over a network). In contrast, traditional devices such as laptop displays, televisions, and mobile handset displays are functionally homogeneous in that all of these devices consist of rectangular display screens and use 2D rectangular video or still images as their primary media formats.
Disclosure of Invention
The present disclosure provides methods and apparatus for media processing. According to some aspects of the present disclosure, a method of media processing includes receiving, by a network device, scene-based immersive media for play on a light field-based display. The scene-based immersive media includes a plurality of scenes. The method includes assigning priority values to the plurality of scenes in the scene-based immersive media, respectively; and determining an ordering of streaming the plurality of scenes to the terminal device according to the priority value.
In some examples, the method includes reordering the plurality of scenes according to the priority value; and transmitting the reordered plurality of scenes to the terminal device.
In some examples, the method includes the priority aware network device selecting a highest priority scene having a highest priority value from a subset of untransmitted scenes of the plurality of scenes; and transmitting the highest priority scene to the terminal device.
In some examples, the method includes determining a priority value for a scene based on a likelihood that the scene needs to be rendered.
In some examples, the method includes determining that available network bandwidth is limited; selecting a highest priority scene having a highest priority value from a subset of untransmitted scenes in the plurality of scenes; and transmitting the highest priority scenario in response to the available network bandwidth being limited.
In some examples, the method includes determining that available network bandwidth is limited; the method further includes identifying a subset of the plurality of scenes that is unlikely to be needed for a next rendering based on the priority value, and avoiding streaming the subset of the plurality of scenes in response to the available network bandwidth being limited.
In some examples, the method includes assigning a first priority value to a first scene based on a second priority value of a second scene in response to a relationship between the first scene and the second scene.
In some examples, the method includes receiving a feedback signal from the terminal device and adjusting at least a priority value of a scene of the plurality of scenes based on the feedback signal. In one example, the method includes assigning a highest priority to a first scene in response to a feedback signal indicating that a current scene is a second scene associated with the first scene. In another example, the feedback signal indicates a priority adjustment determined by the terminal device.
According to some aspects of the present disclosure, a method of media processing includes a terminal device having a light field based display receiving a media presentation description, MPD, of scene-based immersive media for playback by the light field based display, the scene-based immersive media including a plurality of scenes of immersive media, and the MPD indicating streaming of the plurality of scenes to the terminal device in a ranking. The method further includes detecting bandwidth availability, determining a ranking change for at least one scenario based on the bandwidth availability; and transmitting a feedback signal indicating a change in the ordering of the at least one scene.
In some examples, the feedback signal indicates a next scene to be rendered. In some examples, the feedback signal indicates an adjustment to a priority value of the at least one scene. In some examples, the feedback signal indicates a current scene.
According to some aspects of the present disclosure, a method of media processing includes receiving, by a network device, scene-based immersive media for play on a light field-based display. The scene in the scene-based immersive media includes a first ordered plurality of assets. The method includes determining a second ordering for streaming the plurality of assets to a terminal device, the second ordering being different from the first ordering.
In some examples, the method includes assigning priority values to the plurality of assets, respectively, according to one or more attributes of the plurality of assets in the scene, and determining the second ranking for streaming the plurality of assets according to the priority values.
In some examples, the method includes assigning a priority value to an asset in the scene based on a size of the asset.
In some examples, the method includes assigning a priority value to an asset in the scene associated with a default entry location of the scene based on a visibility of the asset.
In some examples, the method includes assigning a first set of priority values to the plurality of assets based on a first attribute of the plurality of assets. The first set of priority values is used to rank the plurality of assets in a first prioritization scheme. The method also includes assigning a second set of priority values to the plurality of assets based on a second attribute of the plurality of assets. The second set of priority values is used to rank the plurality of assets in a second prioritization scheme. The method further comprises selecting a prioritization scheme for the terminal device from the first and second prioritization schemes according to the information of the terminal device; and ordering the plurality of assets for streaming according to the selected prioritization scheme.
In some examples, the method includes assigning a first priority value to a first asset and assigning a second priority value to a second asset in response to the first asset having a higher computational complexity than the second asset, the first priority value being higher than the second priority value.
According to some aspects of the present disclosure, a media processing method includes determining a position and a size of a viewport associated with a camera for viewing a scene in scene-based immersive media. The scene includes a plurality of assets. The method further comprises the steps of: for each asset, determining whether the asset intersects the viewport at least in part based on a location and a size of the viewport and a geometry of the asset, storing an identifier of the asset in a list of visible assets in response to the asset intersecting the viewport at least in part, and including the list of visible assets in metadata associated with the scene.
According to some aspects of the present disclosure, a method of media processing includes receiving profile information (type and number of each type) for a client device attached to a network for adapting media to one or more media requirements of the client device to distribute the media to the client device; and prioritizing a first adaptation according to the profile information of the client device, the first adaptation adapting media to first media requirements for distribution to a first subset of client devices.
Aspects of the present disclosure also provide a non-transitory computer-readable medium having instructions stored thereon, which when executed by a computer, cause the computer to perform a method for media processing.
Drawings
Further features, properties and various advantages of the disclosed subject matter will become more apparent from the following detailed description and drawings in which:
fig. 1 illustrates a media streaming process in some examples.
Fig. 2 illustrates a media conversion decision process in some examples.
Fig. 3 illustrates a representation of a format of heterogeneous immersive media timed in one example.
Fig. 4 illustrates a representation of a streamable format of heterogeneous immersive media that is not timed in one example.
Fig. 5 shows a schematic diagram of a process of synthesizing media from natural content into an ingest format in some examples.
FIG. 6 illustrates a schematic diagram of a process for creating an ingestion format for composite media in some examples.
FIG. 7 is a schematic diagram of a computer system according to an embodiment.
Fig. 8 illustrates a network media distribution system that supports various legacy displays and displays with heterogeneous immersive media capabilities (heterogenous immersive-media capabledisplays) as client endpoints in some examples.
Fig. 9 shows a schematic diagram of an immersive media distribution module capable of serving legacy displays and displays with heterogeneous immersive media capabilities in some examples.
Fig. 10 shows a schematic diagram of a media adaptation process in some examples.
Fig. 11 depicts a distribution format creation process in some examples.
Fig. 12 illustrates a packetizer processing system in some examples.
FIG. 13 illustrates a sequence diagram of adapting particular immersive media in an ingestion format to a network of streamable and suitable distribution formats for particular immersive media client endpoints in some examples
Fig. 14 shows a schematic diagram of a media system with hypothetical networks and client devices for scene-based media processing in some examples.
Fig. 15 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
Fig. 16 shows a schematic diagram for adding complexity values and priority values to scenes in a scene manifest (scene manifest) in some examples.
Fig. 17 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
Fig. 18 shows a schematic diagram of a virtual museum in some examples to illustrate scene priorities.
Fig. 19 shows a schematic diagram of a scene for scene-based immersive media in some examples.
Fig. 20 shows a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
Fig. 21 shows a flowchart outlining a process according to an embodiment of the present disclosure.
Fig. 22 shows a flowchart outlining a process according to an embodiment of the present disclosure.
FIG. 23 shows a schematic diagram of a scene manifest with a mapping of scenes to assets in one example.
FIG. 24 shows a schematic diagram of streaming assets for a scene in some examples.
FIG. 25 shows a schematic diagram for reordering assets in a scene in some examples.
Fig. 26 shows a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
Fig. 27 shows a schematic diagram of a scenario in some examples.
FIG. 28 shows a schematic diagram for reordering assets in a scene in some examples.
Fig. 29 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
Fig. 30 shows a schematic diagram of a scenario in some examples.
FIG. 31 shows a schematic diagram for assigning priority values to assets in a scene in some examples.
Fig. 32 shows a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
Fig. 33 shows a schematic diagram of a scenario in some examples.
FIG. 34 shows a schematic diagram for reordering assets in a scene in some examples.
Fig. 35 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
Fig. 36 shows a schematic diagram of a display scene in some examples.
Fig. 37 shows a schematic diagram of a scene 3701 in a scene-based immersive media in one example.
FIG. 38 shows a schematic diagram for prioritizing assets in a scene in some examples.
Fig. 39 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples.
FIG. 40 shows a schematic diagram illustrating metadata attributes of an asset in one example.
FIG. 41 shows a schematic diagram for prioritizing assets in a scene in some examples.
Fig. 42 shows a flowchart outlining a process according to an embodiment of the present disclosure.
Fig. 43 shows a flowchart outlining a process according to an embodiment of the present disclosure.
FIG. 44 shows a schematic diagram of a timed media representation that signals assets in a default camera view in some examples.
FIG. 45 illustrates a schematic diagram of non-timed media representations that in some examples signal assets in a default viewport.
FIG. 46 illustrates a process flow for a scene analyzer to analyze assets in a scene according to a default viewport.
Fig. 47 shows a flowchart outlining a process according to an embodiment of the present disclosure.
Fig. 48 illustrates a schematic diagram of an immersive media distribution module in some examples.
Fig. 49 shows a schematic diagram of a media adaptation process in some examples.
Fig. 50 shows a flowchart outlining a process 5000 according to an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Aspects of the present disclosure provide architectures, structures, components, techniques, systems, and/or networks for distributing media including video, audio, geometric (3D) objects, haptic, associated metadata, or other content of a client device. In some examples, the architecture, structure, component, technique, system, and/or network is configured to distribute media content to heterogeneous immersive and interactive client devices, such as a game engine. In some examples, the client device may include a light field display or a holographic immersive display. It should be noted that while in some embodiments of the present disclosure, an immersive media for a light field display or a holographic immersive display is used as an example, the disclosed techniques may be used on any suitable immersive media.
Immersive media is defined by immersive techniques that attempt to create or mimic the physical world through digital simulation, thereby simulating any or all of the human sensory systems to create a perception that the user is physically present in the scene.
Several different types of immersive media technologies are currently being developed and used: virtual Reality (VR), augmented Reality (Augmented Reality, AR), mixed Reality (MR), light field, hologram, etc. VR refers to a digital environment that replaces the user's physical environment by placing the user in a computer-generated world using headphones. On the other hand, AR adopts digital media and layers the digital media into the real world around the user by using clear vision (clear vision) or a smart phone. MR refers to the fusion of the real world with the digital world, creating an environment where the technology and physical world can coexist.
Holographic and light field techniques can create a virtual environment with accurate depth and three-dimensional sensations without the use of any headphones, thus avoiding the side effects of motion sickness and the like. Light field and holographic techniques may use rays in 3D space, in some examples, rays from each point and each direction. Light field and holographic techniques are based on the concept that everything that is seen is illuminated by light from any source that propagates in space and impinges on the surface of an object, part of which is absorbed and part of which is reflected to another surface before reaching the eyes of a person. In some examples, light fields and holographic techniques may reproduce light fields that provide 3D effects (e.g., binocular vision functionality and continuous motion parallax) to a user. For example, a light field display may include a large array of projection modules that project light rays onto a holographic screen to reproduce an approximation of the light field by displaying different but consistent information in slightly different directions. In another example, a light field display may emit rays according to a plenoptic function, and the rays converge in front of the light field display to create a real 3D aerial image (real image). In the present disclosure, a light field based display refers to a display using light rays in 3D space, such as a light field display, a holographic display, and the like.
In some examples, the rays may be defined by a five-dimensional plenoptic function, where each ray may be defined by three coordinates in 3D space (three dimensions) and two angles of a particular direction in 3D space.
Typically, to capture the content of 360 degree video, a 360 degree camera is required. However, to collect content for a light Field or holographic based display, in some examples, depending on the Field of View (FoV) of the scene to be rendered, expensive settings including multiple depth cameras or camera arrays may be required.
In one example, a conventional camera may collect a 2D representation of light reaching a camera lens at a given location. The image sensor records the sum of the brightness and color of all light reaching each pixel.
In some examples, to capture the content of a light field based display, a light field camera (also referred to as a plenoptic camera) is used, which can capture not only brightness and color, but also the direction of all rays reaching the camera sensor. Using the information acquired by the light field camera, a digital scene can be reconstructed with an accurate representation of the origin of each ray, making it possible to digitally reconstruct a precisely acquired scene in 3D.
Two techniques were developed to acquire this volumetric scene (volume). The first technique uses a camera or an array of camera modules to collect different light rays/views from each direction. The second technique uses plenoptic cameras (or plenoptic camera arrays) that can also collect light from different directions.
In some examples, the volumetric scene may also be synthesized using computer generated image (Computer Generated Imagery, CGI) techniques, and then rendered using the same techniques used to render the volumetric scene acquired by the camera.
Whether captured by plenoptic cameras or synthesized by CGI, multimedia content for light field based displays is captured and stored in a media server (also referred to as a media server device). Finally, the content is converted into a geometric representation that includes explicit information about the geometry of each object in the visual scene, as well as metadata describing the surface properties (e.g., smoothness, roughness, ability to reflect, refract, or absorb light). Each object in a visual scene is referred to as an "asset" of the scene. In order to transmit these data to the client device, a large amount of bandwidth is required even after compressing the data. Thus, in bandwidth limited situations, the client device may experience buffering or disruption, making the experience unpleasant.
In some examples, the use of a content delivery network (Content Delivery Network, CDN) and edge network elements may reduce latency between a client device's request for a scenario and delivering the scenario (including each visual object of the scenario, a set of visual objects being referred to as an "asset" of the scenario) to the client device. Furthermore, in contrast to video that is simply decoded and placed in a series of linear frame buffers for presentation to a user, immersive media scenes are composed of objects and rich metadata that need to be rendered by real-time renderers, similar to renderers provided with game engines (e.g., grace engines (Unity engines) of grace science and technology company (Unity Technologies) or non-real engines (un engines) of poem gaming company (Epic Games)), which may be computationally intensive and require a large amount of resources to fully construct the scene. For such scene-based media, the use of cloud/edge network elements may be used to transfer the computational load (offflow) rendered by the client device to a more powerful computational engine in the network, such as one or more server devices, etc.
Multimedia content for a light field based display is collected and stored in a media server. The multimedia content may be real world content or composite content. In order to transmit data of multimedia contents to a terminal device (also referred to as a client device or a terminal device) of a client, a large amount of bandwidth is required even after compressing the data. Thus, in bandwidth limited situations, the client device may experience buffering or disruption, which may make the experience unpleasant.
As previously mentioned, immersive media generally refers to media that stimulates any or all of the human sensory systems (visual, auditory, somatosensory, olfactory, and possibly gustatory) to create or enhance perception that the user is physically present in the media experience, i.e., media other than media distributed over existing commercial networks for timing two-dimensional (2D) video and corresponding audio (referred to as "traditional media"). In some examples, immersive media refers to media that attempts to create or mimic the physical world through digital simulation of dynamics and laws of physics, thereby stimulating any or all of the human sensory systems in order to create the perception to a user that they are physically present in a scene depicting the real or virtual world. Both immersive and legacy media can be classified as timed and non-timed.
Timed media refers to media organized and presented according to time. Examples include movie features, news stories, episode content, all of which are organized according to time periods. Conventional video and audio are generally considered to be timed media.
Non-timed media refers to media that is not organized by time, but rather by logical, spatial, and/or temporal relationship. One example includes a video game in which a user may control an experience created by a gaming device. Another example of non-timed media is still image photographs taken by a camera. Non-timed media may incorporate timed media into a continuous loop audio or video clip of a scene, such as a video game. In contrast, the timing medium may incorporate non-timing media, such as video with a fixed still image as a background.
A device with immersive media capabilities may be a device that is provided with sufficient resources and capabilities to access, interpret, and render the immersive media. Such media and devices are heterogeneous in terms of the amount and format of the media and the amount and type of network resources required to distribute such media on a large scale (i.e., to achieve a distribution equivalent to conventional video and audio media over a network). Similarly, media is heterogeneous in terms of the amount and type of network resources required to distribute such media on a large scale. "massive" may refer to media distribution by a service provider that achieves distribution of equivalent traditional video and audio media over networks (e.g., netflix, cucurbit (Hulu), comcast (Comcast) subscription networks, and Spectrum (Spectrum) subscription networks).
In general, conventional devices such as laptop displays, televisions, and mobile handset displays are functionally homogeneous in that all of these devices consist of rectangular display screens and use 2D rectangular video or still images as their primary media formats. Similarly, the number of audio formats supported by conventional devices is limited to a relatively small set.
The term "frame-based" media refers to the feature that visual media consists of one or more consecutive rectangular image frames. In contrast, "scene-based" media refers to visual media organized by "scenes," where each scene refers to a plurality of individual assets (individual asserts) that collectively describe the visual characteristics of the scene.
A comparative example between frame-based visual media and scene-based visual media may be described using visual media of a graphic forest. In a frame-based representation, a camera device (e.g., a mobile phone with a camera) is used to capture a forest. The user may enable the camera device to focus on a forest and the frame-based media captured by the camera device is the same as the user sees through a camera viewport provided on the camera device, including any movement of the camera device initiated by the user. The resulting frame-based representation of the forest is a series of 2D images recorded by the camera device, typically at a standard rate of 30 frames per second or 60 frames per second. Each image is a collection of pixels, where the information stored in each pixel is congruent (congreunt) from one pixel to the next.
In contrast, a scene-based representation of a forest is composed of a plurality of individual assets describing each object in the forest and a human-readable scene graph description that presents a large amount of metadata describing how the asset or assets are rendered. For example, a scene-based representation may include multiple individual objects called "trees" where each tree is made up of a collection of smaller assets called "trunks", "branches" and "leaves". Each trunk may be further described separately by two: a mesh describing the complete 3D geometry of the trunk (trunk mesh), and textures applied to the trunk mesh to acquire color properties and radiation (radiance) properties of the trunk. Furthermore, the trunk may be accompanied by additional information describing the surface of the trunk in terms of its smoothness or roughness or its ability to reflect light. The corresponding human-readable scene graph description may provide information about where to place the trunk relative to the viewport of the virtual camera focused into the forest scene. Furthermore, the human-readable description may include information about how many branches were generated from a single branch asset called a "branch" and where to place those branches in the scene. Similarly, the description may include how many leaves are generated and the location of the leaves relative to the branches and trunks. Furthermore, the transformation matrix may provide information on how to scale or rotate the leaves so that the leaves look different. In general, the individual assets that make up a scene differ in terms of the type and amount of information stored in each asset. Each asset is typically stored in its own file, but these assets are typically used to create multiple instances of objects that the asset is designed to create, e.g., branches and leaves of each tree.
In some examples, the human-readable portion of the scene graph is rich in metadata to describe not only the relationship of the asset to its location in the scene, but also instructions on how to render the object, e.g., using various types of light sources, or using surface properties (to indicate whether the object has a metallic or matte surface) or other materials (porous or smooth textures). Other information typically stored in the human-readable portion of the scene graph is the relationship of the asset to other assets, e.g., forming a group of assets that are rendered or processed as a single entity, e.g., a trunk with branches and leaves.
Examples of a scene graph with human-readable components include a graphical language transport format 2.0 (Graphics Language Transmission Format, glTF 2.0), where the node tree component is provided in the form of a Java Script object representation (Java Script Obejct Notation, JSON), which is a human-readable annotation, to describe an object. Another example of a scene graph with human-readable components is an immersive technology media format (Immersive Technologies Media Format), where OCS files are generated using XML (another human-readable annotation format).
Another difference between scene-based media and frame-based media is that in frame-based media, the view created for the scene is the same as the view the user acquired via the camera, i.e., at the time the media is created. When frame-based media is presented by a client, the view of the presented media is the same as the view acquired in the media by, for example, a camera used to record video. However, for scene-based media, there may be multiple ways for a user to view a scene using various virtual cameras (e.g., a thin-lens camera or a panoramic camera). In some examples, visual information of a scene viewable by a viewer may vary according to the viewing angle and position of the viewer.
A client device supporting scene-based media may be equipped with a renderer and/or resources (e.g., graphics processing unit (Graphic Processing Unit, GPU), central processing unit (Central Processing Unit, CPU), local media cache memory) whose capabilities and supported functionality together include an upper bound or bound to characterize the overall capability of the client device to ingest various scene-based media formats. For example, a mobile handset client device may be limited by the complexity of the geometric asset that the mobile handset client device may render (particularly to support real-time applications), such as the number of polygons describing the geometric asset. Such limitations may be established based on the fact that the mobile client is battery powered and, thus, the amount of computing resources available to perform real-time rendering is likewise limited. In this case, the client device may wish to inform the network client device that the preference to access geometric assets having a number of polygons that is no greater than the client specified upper limit. Furthermore, information transferred from the client to the network may be optimally transferred using a well-defined protocol that leverages a dictionary of well-defined attributes.
Similarly, the media distribution network may have computing resources that facilitate the distribution of immersive media in various formats to various clients having various capabilities. In such networks, it may be desirable to inform the network of client-specific capabilities (clients-specific capabilities) according to a well-defined profile protocol (profile protocol), for example, via a dictionary of attributes conveyed by a well-defined protocol. Such an attribute dictionary may include information describing minimum computing resources required for the media or real-time rendering of the media so that the network can better establish how to provide priority of (service) media to its heterogeneous clients. In addition, a centralized data store that gathers customer-provided profile information across customer domains helps provide a summary of which types and formats of assets are high in demand. By providing information about which types of assets are in higher or lower demand, the optimized network can prioritize tasks that respond to higher demand asset requests.
In some examples, media distribution over a network may use media delivery systems and architectures that reformat media from an input or network "ingest" media format into a distributed media format. In one example, the distribution media format is not only suitable for ingestion by the target client device and its applications, but also facilitates "streaming" over the network. In some examples, there may be two processes performed by the network on the ingested media: 1) That is, based on the client device's ability to ingest certain media formats, media is converted from format a to format B suitable for ingestion by the target client device, and 2) media to be streamed is prepared.
In some examples, "streaming" of media broadly refers to the segmentation and/or grouping of media such that processed media may be transmitted over a network in successively smaller sized "chunks" logically organized and ordered according to one or both of a temporal or spatial structure of the media. In some examples, the "conversion" of media from format a to format B, which may sometimes be referred to as "transcoding," may be a process that is typically performed by a network or service provider prior to distributing the media to a target client device. Such transcoding may include converting the media from format a to format B based on a priori knowledge of: format B is to some extent the preferred or unique format that can be ingested by the target client device or format B is more suitable for distribution over limited resources such as a business network. One example of media conversion is converting media from a scene-based representation to a frame-based representation. In some examples, both the steps of converting the media and preparing the media to be streamed are necessary before the target client device can receive and process the media from the network. This a priori knowledge about the client's preferred format may be obtained through the use of a profile protocol that utilizes a agreed-upon attribute dictionary that summarizes the characteristics of the scene-based media preferred across the various client devices.
In some examples, the one or two-step process described above, in which the network acts on the ingested media, creates a media format referred to as a "distribution media format" or simply a "distribution format" prior to distributing the media to the target client device. In general, if these steps are performed for a given media data object, they may be performed only once if the network has access to information indicating that the target client device will need to convert and/or stream the media object at multiple occasions, otherwise such conversion and streaming of media will be triggered multiple times. That is, the processing and transmission of data for media conversion and streaming is generally considered to be a cause of delay, requiring consumption of potentially large amounts of network and/or computing resources. Thus, a network design that does not have access to information indicating when a client device may already have a particular media data object stored in its cache or stored locally with respect to the client device would be suboptimal for a network that does have access to such information.
In some examples, for legacy presentation devices, the distribution format may be equivalent or substantially equivalent to a "presentation format" that a client device (e.g., a client presentation device) ultimately uses to create a presentation. For example, the presentation media format is one whose properties (resolution, frame rate, bit depth, color gamut, etc. … …) are closely coordinated with the capabilities of the client presentation device. Some examples of distribution formats and presentation formats include: high Definition (HD) video signals (1920 columns x 1080 rows of pixels) distributed by the network to Ultra High Definition (UHD) client devices having a resolution (3840 columns x 2160 rows of pixels). For example, a UHD client device may apply a process called "super resolution" to HD distribution formats to increase the resolution of video signals from HD to UHD. Thus, the final signal format presented by the UHD client device is a "presentation format", which in this example is a UHD signal, whereas the HD signal comprises a distribution format. In this example, the HD signal distribution format is very similar to the UHD signal presentation format, because both signals are in a straight line video format (rectilinear video format), and the process of converting HD format to UHD format is a relatively simple and easy to perform process on most conventional client devices.
In some examples, the preferred presentation format of the target client device may be significantly different from the ingestion format received by the network. However, the target client device may access sufficient computing resources, storage resources, and bandwidth resources to convert the media from the ingest format to a necessary presentation format suitable for presentation by the target client device. In this case, the network may bypass the step of reformatting the ingested media, such as "transcoding" the media from format a to format B, simply because the client device may access sufficient resources to perform all media conversions without having to perform this step by the network. However, the network may still perform the steps of segmenting and grouping the ingested media so that the media may be streamed to the target client device.
In some examples, the ingested media received by the network is significantly different from the preferred presentation format of the target client device, and the target client device cannot access sufficient computing resources, storage resources, and/or bandwidth resources to convert the media to the preferred presentation format. In this case, the network may assist the target client device by performing some or all of the conversion from the ingest format to a format equivalent or nearly equivalent to the target client device's preferred presentation format on behalf of the target client device. In some architectural designs, such assistance provided by the network on behalf of the target client device is referred to as "split rendering" or "adaptation" of the media.
Fig. 1 illustrates a media streaming process 100 (also referred to as process 100) in some examples. The media streaming process 100 includes a first step that may be performed by the network cloud (or edge device) 104 and a second step that may be performed by the client device 108. In some examples, at step 101, the network receives media in ingest media format a from a content provider. Step 102, which is a network processing step, may prepare media for distribution to client device 108 by formatting the media into format B and/or by preparing the media to be streamed to client device 108. In step 103, media is streamed from the network cloud 104 to the client device 108 via the network connection, for example using a network protocol such as transmission control protocol (Transmission Control Protocol, TCP) or user data protocol (User Datagram Protocol, UDP). In some examples, such streamable media depicted as media 105 may be streamed to media store 110. The client device 108 accesses the media store 110 via the retrieval mechanism 111, such as ISO/IEC 23009 dynamic adaptive streaming over HTTP (ISO/IEC 23009 Dynamic Adaptive Streaming over HTTP). The client device 108 receives or acquires the distributed media from the network and may prepare the media for presentation via a rendering process as shown at 106. The output of rendering process 106 is rendered media in another potentially different format C as shown at 107.
Fig. 2 illustrates a media conversion decision making process 200 (also referred to as process 200) that shows a network logic flow for processing ingested media within a network (also referred to as a network cloud), for example, by one or more devices in the network. At 201, a network cloud ingests media from a content provider. At 202, attributes of the target client device are obtained (if the attributes of the target client device are not yet known). If desired, a decision 203 determines whether the network should assist in the conversion of the media. When the decision 203 determines that the network should assist in the conversion, the ingested media is converted from format a to format B by processing 204, resulting in converted media 205. At 206, the converted or original form of media is prepared for streaming. At 207, the prepared media is suitably streamed to a target client device, such as a game engine client device, or to the media store 110 in FIG. 1.
An important aspect of the logic in fig. 2 is the decision process 203, which may be performed by an automated process. This decision step may determine whether the media may be streamed in its original ingestion format a or whether the media needs to be converted to a different format B to facilitate presentation of the media by the target client device.
In some examples, the decision process step 203 may require access to information describing aspects or characteristics of the ingested media in order to assist the decision process step 203 in making the best choice, i.e., determining whether the ingested media needs to be converted prior to streaming the media to the target client device, or whether the media may be directly streamed to the target client device in the original ingestion format a.
According to one aspect of the disclosure, streaming of scene-based immersive media may be different from streaming of frame-based media. For example, streaming of frame-based media may be equivalent to streaming of video frames, where each frame captures a complete image of an entire scene or an entire object to be presented by a client device. When a sequence of frames is reconstructed from its compressed form and presented to a viewer by a client device, a video sequence is created that includes the entire immersive presentation or a portion of the presentation. For frame-based media streams, the ordering of frames from the network to the client device may be consistent with predefined specifications, such as international telecommunications union telecommunication standardization sector (International Telecommunication Union-Telecommunication Sector, ITU-T) recommended h.264 general audiovisual service advanced video coding.
However, streaming of scene-based media differs from frame-based streaming in that a scene may be made up of multiple individual assets that themselves may be independent of each other. A given scene-based asset may be used multiple times within a particular scene or across a series of scenes. The amount of time required for a client device or any given renderer to create a proper rendering of a particular asset may depend on many factors, including but not limited to: the size of the asset, the availability of computing resources to perform rendering, and other attributes describing the overall complexity of the asset. A client device supporting scene-based streaming may require that some or all of the rendering of each asset within a scene be completed before any rendering of the scene may begin. Thus, the ordering of assets from network streaming to client devices may affect overall performance.
In accordance with one aspect of the present disclosure, it is assumed that each of the above scenarios in which the conversion of media from format a to another format may be completed entirely by the network, entirely by the client device, or by both the network and the client device in combination, e.g., for separate rendering, an attribute dictionary describing the media format may be required so that both the client device and the network have complete information characterizing the conversion effort. Further, for example, a dictionary of attributes that provide client device capabilities may also be needed in terms of available computing resources, available storage resources, and access to bandwidth. Even further, a mechanism to characterize the level of computational, storage, or bandwidth complexity of the ingestion format may be required so that the network and client device may jointly or separately determine whether or when the network may employ a separate rendering step to distribute media to the client device. Furthermore, if the client device can be prevented from converting and/or streaming particular media objects that are or will be required for the client device to complete its media presentation, the network can skip the steps of converting and streaming the ingest media provided that the client device can access or obtain media objects that may be required (e.g., previously streamed to the client device) to complete the client device's media presentation. In some examples, if the conversion from format a to another format is determined to be a necessary step performed by or on behalf of a client device, a prioritization scheme for ordering the conversion process of multiple individual assets within a scene may be performed and may be beneficial to an intelligent and efficient network architecture.
Regarding the ordering of scene-based assets streamed from a network to a client device (which serves to facilitate the ability of the client device to perform with its full potential), it may be desirable for the network to be equipped with sufficient information so that the network can determine such ordering to improve the performance of the client device. For example, such networks have sufficient information to avoid repeated conversion and/or streaming steps of assets that are used more than once for a particular presentation, and such networks may perform better than networks without such designs. Similarly, a network that is able to "intelligently" rank asset delivery to clients may facilitate the ability of client devices to perform with their full potential, i.e., create an experience that may be more enjoyable to the end user. Further, the interface between the client device and the network (e.g., a server device in the network) may be implemented using one or more communication channels over which basic information is communicated regarding: characteristics of the operating state of the client device, availability of resources at or local to the client device, the type of media to be streamed, and the frequency of assets to be used or across multiple scenarios. Thus, implementing a network architecture that streams scene-based media to heterogeneous clients may require access to a client interface that can use information related to the processing of each scene (including current conditions related to the client device's ability to access computing and storage resources) to provide and update the web server process. Such client interfaces may also interact tightly with other processes executing on the client device, particularly with game engines that play a necessary role in delivering an immersive experience to an end user on behalf of the client device. Examples of where game engines may play a necessary role include providing application program interfaces (Application Program Interface, APIs) to enable delivery of interactive experiences. Another role that the game engine may provide on behalf of the client device is to render the precise visual signals required by the client device to deliver a visual experience consistent with the capabilities of the client device.
Definitions of some terms used in the present disclosure are provided in the following paragraphs.
Scene graph (Scene graph): vector-based graphics editing applications and the general data structures commonly used in modern computer games arrange logical and often (but not necessarily) spatial representations of a graphics scene; the collection of nodes and vertices in the graph structure.
Scene (Scene): in the context of computer graphics, a scene is a collection of objects (e.g., 3D assets), object attributes, and other metadata, including visual, auditory, and physical-based features that describe a particular setting whose interaction with respect to objects in the setting is limited in space or time.
Node (Node): the basic elements of a scene graph consist of information related to the logical or spatial or temporal representation of visual, auditory, tactile, olfactory, gustatory or related process information; each node has at most one output edge, zero or more input edges, and at least one edge (input or output) connected thereto.
Base Layer (Base Layer): nominal representations of assets are typically used to minimize the computational resources or time required to render the asset, or the time to transmit the asset over a network.
Reinforcing layer (Enhancement Layer): a set of information that, when applied to a base layer representation of an asset, enhances the base layer to include characteristics or functions not supported by the base layer.
Attribute (Attribute): metadata associated with a node for describing a particular feature or characteristic of the node in a canonical or more complex form (e.g., from another node).
Binding Look-Up Table (LUT)): a logic structure associates metadata from an information management system (Information Management System, IMS) of ISO/IEC 23090 section 28 with metadata or other mechanisms for describing characteristics or functions of a particular scene graph format (e.g., immersive technology media format (Immersive Technologies Media Format, ITMF), glTF, universal scene description (Universal Scene Description)).
Container (Container): a serialization format for storing and exchanging information to represent all natural scenes, including a scene graph, all synthetic scenes, or a mix of synthetic and natural scenes, and all media assets required to render a scene.
Serialization (Serialization): a process of converting a data structure or object state into a format that can be stored (e.g., in a file or memory buffer) or transmitted (e.g., over a network connection link) and later reconstructed (possibly in a different computer environment). When the resulting bit sequence is reread according to a serialization format, serialization can be used to create semantically identical clones of the original object.
Renderer (Renderer): an (typically software-based) application or process based on a selective mix of disciplines related to acoustic physics, photophysics, visual perception, audio perception, mathematics, and software development, a renderer emits typical visual and/or audio signals suitable for rendering on a target device or conforming to desired attributes specified by the attributes of rendering target nodes in a scene graph given an input scene graph and an asset container. For vision-based media assets, the renderer may emit a visual signal suitable for the target display or stored as an intermediate asset (e.g., regrouped into another container, i.e., used in a series of rendering processes in the graphics pipeline); for audio-based media assets, the renderer can emit audio signals for presentation in a multi-channel speaker and/or dual-audio headphones, or for repartition into another (output) container. Common examples of renderers include the real-time rendering characteristics of game Engine Unity and Unreal Engine.
Evaluation (evaluation): results (e.g., similar to an evaluation of the document object model (Document Object Model) of the web page) are generated that shift the output from the abstract results to the concrete results.
Scripting language (Scripting language): an interpreted programming language, which may be executed at run-time by a renderer, to handle dynamic inputs and variable state changes made to scene graph nodes, which can affect the rendering and evaluation of spatial and temporal object topologies (including physical forces, constraints, inverse kinematics, deformations, collisions) and energy propagation and transmission (light, sound).
Shader (Shader): a computer program is initially used for rendering (producing appropriate levels of light, dark and color in an image), but now performs various specialized functions in various areas of computer graphics special effects, or performs video post-processing independent of rendering, even functions that are completely independent of graphics.
Path Tracing (Path Tracing): a computer graphics method for rendering three-dimensional scenes, which renders illumination of the scene faithful to reality.
Timed media (Timed media): time-ordered media; for example, having a start time and an end time according to a particular clock.
Non-timed media (un-timed media): media organized in spatial, logical, or temporal relation, e.g., as in an interactive experience implemented according to action taken by the user(s).
Neural network model (Neural Network Model): a set of parameters and tensors (e.g., matrices) defining weights (i.e., values) for use in well-defined mathematical operations applied to the visual signal to obtain an improved visual output, which may include interpolation of a new view of the visual signal that the original signal did not explicitly provide.
Translation (Translation): refers to the process of converting one media format or type to another media format or type, respectively.
Adaptation (Adaptation): refers to translating and/or converting media into multiple representations of bit rates.
Frame-based media): 2D video with or without associated audio.
Scene-based media): audio, video, haptic and other major types of media and media related information are logically and spatially organized using scene graphs.
Over the past decade, many immersive media capable devices have been introduced into the consumer market, including head mounted displays, augmented reality glasses, hand held controllers, multi-view displays, haptic gloves, and game consoles. Similarly, holographic displays and other forms of volumetric displays will enter the consumer market within three to five years of the future. Although these devices have been or are imminent, a consistent end-to-end ecosystem for distributing immersive media over a commercial network has not been achieved for several reasons.
One of the obstacles to achieving a consistent end-to-end ecosystem for distributing immersive media over a commercial network is that client devices that serve as endpoints of such a distribution network for immersive displays are very diverse. Some of which support some immersive media formats and others of which do not. Some of which are capable of creating an immersive experience from a traditional raster-based format, while others are not. Unlike networks designed only for distributing traditional media, networks that must support multiple display clients require a large amount of information about the details of each capability of the client and the format of the media to be distributed, such networks can then employ an adaptation process to convert the media into a format suitable for each target display and corresponding application. Such a network requires access to at least information describing the characteristics of each target display and the complexity of capturing the media, so that the network determines how meaningfully to adapt the input media source to a format suitable for the target display and application. Similarly, a network optimized for efficiency may want to maintain a database of media types and their corresponding attributes supported by client devices attached to such a network.
Similarly, an ideal network supporting heterogeneous client devices may take advantage of the fact that some assets that are adapted from an input media format to a particular target format may be reused across a set of similar display targets. That is, some assets, once converted into a format suitable for the target display, can be reused on multiple such displays with similar adaptation requirements. Thus, such an ideal network would employ a caching mechanism to store the adapted assets into a relatively immutable region, i.e., similar to the use of Content Delivery Networks (CDNs) used in conventional networks.
Furthermore, the immersive media may be organized into "scenes" described by a scene graph, such as "scene-based media," which is also referred to as a scene description. The scope of the scene graph is to describe video, audio, and other forms of immersive assets that include specific settings as part of the presentation, e.g., actors and events that occur in specific locations in a building as part of the presentation (e.g., a movie). The list of all scenes that make up a single presentation may be formulated as a list of scenes (also referred to as a scene list).
Another benefit of this approach is that for content that is prepared before such content must be distributed, a "bill of materials (bill of materials)" can be created that identifies all of the assets that will be used for the entire presentation, as well as the frequency with which each asset is used in the various scenarios in the presentation. The ideal network should be aware of the existence of cache resources that can be used to meet the asset requirements of a particular presentation. Similarly, a client that is rendering a series of scenes may wish to know the frequency of any given asset to be used in multiple scenes. For example, if a media asset (also referred to as an "object") is referenced multiple times in multiple scenes that the client is or will be handling, the client should avoid discarding the asset from its cached resources until the client presents the last scene that requires the particular asset.
Furthermore, such a process that may generate a "bill of materials" for a given scene or set of scenes may also annotate the scene(s) with standardized metadata (e.g., IMS from ISO/IEC 23090 section 28) to facilitate adaptation of the scene from one format to another.
Finally, many emerging advanced imaging displays include, but are not limited to: the Oculus lift, samsung Gear VR, magic Leap goggles, all Looking Glass Factory displays, the SolidLight, avalon holographic display of Light Field Lab (Light Field laboratory) corporation, and the Dimenco display utilize a game engine as a mechanism to ingest content for rendering and presentation on the display. Currently, the most popular game engines used in the above set of displays include the Unreal Engine of Epic Games and the Unity of Unity Technologies. That is, advanced imaging displays are currently designed and shipped with one or both of these game engines used as the mechanical means by which the display can obtain media to be rendered and presented by such advanced imaging displays. Both the Unreal Engine and the Unity are optimized to ingest scene-based media rather than frame-based media. However, existing media distribution ecosystems can only stream frame-based media. There is a great "gap" in current media distribution ecosystems, including standards (legal or practical) and best practices, to enable scene-based content to be distributed to emerging advanced imaging displays so that media can be delivered "on a large scale", e.g., on the same scale as that on which frame-based media was distributed.
In some examples, a mechanism or process responsive to the web server process (es) and participating in the combined web and immersive client architecture may be used on behalf of the client device on which the game engine is utilized to ingest the scene-based media. Such a "smart client" mechanism is particularly interesting in networks designed to stream scene-based media to immersive heterogeneous and interactive client devices, such that the distribution of the media is performed efficiently and within the constraints of the capabilities of the various components comprising the entire network. A "smart client" is associated with a particular client device and responds to a request by the network for information related to the current state of the client device with which it is associated, including the availability of resources on the client device for rendering and creating a presentation of scene-based media. The "Smart client" also acts as an "intermediary" between the client device using the game engine and the network itself.
It should be noted that without loss of generality, the remainder of the disclosed subject matter assumes that a smart client that is able to respond on behalf of a particular client device is also able to respond on behalf of a client device on which one or more other applications (i.e., not game engine applications) are active. That is, the problem of responding on behalf of the client device is equivalent to the problem of responding on behalf of the client device on which one or more other applications are active.
It should further be noted that the terms "media object" and "media asset" may be used interchangeably, both referring to a particular instance of media in a particular format. The term client device or client (without any limitation) refers to the device and its constituent components upon which the media presentation is ultimately performed. The term "game Engine" refers to a Unity Engine or a Unreal Engine, or any game Engine that functions in a distributed network architecture.
In some examples, the network or client device may use a mechanism or process that analyzes the immersive media scenario to obtain sufficient information available to support the decision making process. When employed by a network or client, the mechanism or process may provide an indication of whether the conversion of a media object (or media asset) from format a to format B should be performed entirely by the network, entirely by the client, or via a hybrid of both the network and the client (and an indication of which assets should be converted by the client or network). Such an "immersive media data complexity analyzer" (also referred to as a media analyzer in some examples) may be used by a client device or a network device in an automation context.
Referring back to FIG. 1, the media streaming process 100 illustrates a media stream over the network 104 or distributed to a client device 108 on which a game engine is employed. In fig. 1, the process of ingesting media format a is performed by processing in the cloud or edge device 104. At 101, media is obtained from a content provider (not shown). Process step 102 performs any necessary transformations or adjustments to the ingested media to create a potential alternative representation of the media as distribution format B. Media format a and media format B may or may not be representations that follow the same syntax of a particular media format specification, but format B may be adapted to facilitate the distribution of media over a network protocol such as TCP or UDP. Such "streamable" media is depicted as media that is streamed to client device 108 via network connection 105. The client device 108 may access some rendering capabilities depicted as 106. Such rendering capabilities 106 may be basic or similar, or complex, depending on the type of client device 108 and the game engine operating on the client device. Rendering process 106 creates presentation media that may or may not be represented according to a third format specification (e.g., format C). In some examples, in a client device employing a game engine, rendering process 106 is typically a function provided by the game engine.
Referring to fig. 2, a media conversion decision process 200 may be used to determine whether a network needs to convert media prior to distributing the media to a client device. In fig. 2, ingest media 201, represented in format a, is provided to the network by a content provider (not depicted). Process step 202 obtains attributes describing the processing capabilities of the target client (not depicted). Decision process step 203 is used to determine whether the network or client should perform any format conversion on any media assets contained within the ingested media 201, e.g., converting a particular media object from format a to format B, before the media is streamed to the client. If any media asset needs to be converted by the network, the network employs a process step 204 to convert the media object from format A to format B. The converted media 205 is the output from the processing step 204. The converted media is incorporated into a preparation process 206 to prepare the media to be streamed to a game engine client (not shown). For example, process step 207 streams the prepared media to the game engine client.
FIG. 3 illustrates a representation of a streamable format 300 of heterogeneous immersive media timed in one example; fig. 4 illustrates a representation of a streamable format 400 of heterogeneous immersive media that is not timed in one example. In the case of fig. 3, fig. 3 relates to a scene 301 of a timed media. In the case of fig. 4, fig. 4 relates to a scene 401 of non-timed media. For both cases, the scene may be embodied by various scene representations or scene descriptions.
For example, in some immersive media designs, a scene may be embodied by a scene graph, either as a Multi-plane Image (MPI) or as a Multi-spherical Image (MSI). Both MPI and MSI techniques are examples of techniques that help create a display-agnostic scene representation (display-agnostic scene representations) for natural content (i.e., images from the real world captured simultaneously by one or more cameras). Scene graph techniques, on the other hand, may be used to represent natural and computer-generated images in the form of composite representations, however, creating such representations is particularly computationally intensive when content is captured as a natural scene by one or more cameras. That is, the creation of a scene graph representation of naturally acquired content requires both a significant amount of time and computation, requiring complex analysis of the natural image using photogrammetry or deep learning or both photogrammetry and deep learning techniques, in order to create a composite representation that can then be used to interpolate a sufficient and sufficient number of views to populate the viewing frustum of the target immersive client display. Thus, it is considered impractical that such composite representations are currently candidates for representing natural content because such composite representations cannot be actually created in real-time, given the use cases that require real-time distribution. In some examples, the best candidate representation of the computer-generated image is using a scene graph with a synthetic model, as the computer-generated image is created using 3D modeling processes and tools.
Such dichotomy in the optimal representation of natural and computer-generated content suggests that the optimal ingest format of the naturally-harvested content is different from the optimal ingest format of the computer-generated content, or from the optimal ingest format of the natural content (which is not essential for real-time distribution applications). Thus, the disclosed subject matter is sufficiently robust to support multiple ingestion formats of visual immersive media, whether they are created naturally using physical cameras or by computers.
The following are exemplary techniques that embody a scene graph in a format suitable for representing visually immersive media created using computer-generated techniques, or for naturally acquired content (i.e., which is not essential for real-time distribution applications), for which deep learning or photogrammetry techniques are employed to create a corresponding composite representation of a natural scene.
1. OTOY (Europe)
The ORBX of OTOY corporation is one of several scene graph technologies that can support any type of (timed or non-timed) visual media, including ray tracing, traditional (frame-based), volumetric, and other types of synthetic or vector-based visual formats. According to one aspect, ORBX differs from other scene graphs in that ORBX provides native support for free and/or open source formats of grids, point clouds, and textures. ORBX is a well-designed scene graph that aims to facilitate the exchange between multiple vendor technologies operating on the scene graph. In addition, ORBX provides a rich material system, support for open shader languages, a robust camera system, and support for Lua scripts. ORBX is also the basis of the immersive technical media format of the immersive digital experience alliance (Immersive Digital Experiences Alliance, IDEA) for license issuance according to royalty-free terms. In the context of media real-time distribution, the ability to create and distribute an ORBX representation of a natural scene is a function of the availability of computing resources to perform complex analysis of camera-acquired data and to synthesize the same into a composite representation. To date, it has not been practical, but not impossible, to provide adequate computation for real-time distribution.
2. General scene description of Pixar (Picks) company (Universal Scene Description, USD)
The Pixar corporation Universal Scene Description (USD) is another scene graph that may be used in Visual Effect (VFX) and professional content production communities. The USD is integrated into the Omniverse platform of Nvidia (inflight) corporation, which is a tool set for developers to create and render 3D models using the GPU of Nvidia. A subset of USDs published by Apple (Apple) and Pixar are referred to as USDZ. USDZ is supported by ARKit from Apple corporation.
3. glTF2.0 from Khronos (Ke Luonuo S)
gltf2.0 is the latest version of the graphic language transport format specification written by the Khronos 3D group. Such formats support simple scene graph formats that are generally capable of supporting static (non-timed) objects in a scene, including "png" and "jpeg" image formats. gltf2.0 supports simple animations, including support for translation, rotation and scaling of basic shapes described using glTF primitives (i.e., geometric objects). gltf2.0 does not support timing media and therefore does not support video or audio.
4. The ISO/IEC 23090 part 14 scene description is an extension of gltf2.0, which increases support for timed media (e.g. video and audio).
It should be noted that the above-described scene representation of immersive visual media is provided as an example only, and does not limit the ability of the disclosed subject matter to specify a process of adapting an input immersive media source to a format suitable for a particular feature of a client terminal device.
Further, any or all of the above-described example media representations currently employ or may employ deep learning techniques to train and create a neural network model that can or facilitates selection of particular views to populate a viewing frustum of a particular display based on the particular size of the frustum. The view selected for the viewing frustum of a particular display may be interpolated from existing views explicitly provided in the scene representation, for example according to MSI or MPI techniques, or may be rendered directly from the rendering engines based on the description of the particular virtual camera position, filters or virtual camera for these rendering engines.
Thus, the disclosed subject matter is sufficiently robust to take into account that there is a relatively small but well-known set of immersive media ingestion formats that are sufficient to meet the requirements of real-time or "on-demand" (e.g., non-real-time) distribution of media that is naturally acquired (e.g., using one or more cameras) or created using computer-generated techniques.
With advanced networking technologies (e.g., 5G for mobile networks), and deploying fiber optic cables for fixed networks, interpolation of views from immersive media ingestion formats by using neural network models or network-based rendering engines is further facilitated. That is, these advanced network technologies increase the capacity and capabilities of commercial networks because such advanced network infrastructure can support the transmission and transfer of increasingly large amounts of visual information. Network infrastructure management techniques such as multiple access edge computing (Multi-access Edge Computing, MEC), software defined networking (Software Defined Network, SDN), and network function virtualization (Network Functions Virtualization, NFV) enable commercial network service providers to flexibly configure their network infrastructure to accommodate changes in certain network resource requirements, e.g., in response to dynamic increases or decreases in network throughput, network speed, round trip delay, and computing resource requirements. Furthermore, this inherent capability of adapting dynamic network requirements also facilitates the ability of the network to adapt the immersive media ingestion format to a suitable distribution format in order to support a variety of immersive media applications with potentially heterogeneous visual media formats for heterogeneous client endpoints.
The immersive media applications themselves may also have different requirements on network resources, including: a significantly lower network delay is required to respond to real-time updated gaming applications in the gaming state, telepresence applications with symmetric throughput requirements for the uplink and downlink portions of the network, and passive viewing applications that may increase the demand for downlink resources depending on the type of client endpoint display that consumes the data. In general, any consumer-oriented application may be supported by various client endpoints having various on-board client capabilities for storage, computing, and driving, as well as various requirements for a particular media representation.
Thus, the disclosed subject matter enables fully equipped networks, i.e., networks employing some or all of the features of modern networks, to support multiple legacy devices and media-immersive media-capable devices simultaneously according to specified characteristics:
1. flexibility is provided to take full advantage of media ingestion formats that are practical for use cases of real-time and on-demand media distribution.
2. Flexibility in supporting natural content and computer-generated content is provided for legacy client endpoints and client endpoints with immersive media capabilities.
3. Timed media and non-timed media are supported.
4. A process is provided for dynamically adapting a source media ingestion format to a suitable distribution format based on the characteristics and capabilities of a client endpoint and based on the requirements of an application.
5. Ensuring that the distribution format can be streamed over an IP-based network.
6. Enabling the network to serve multiple heterogeneous client endpoints simultaneously, which may include legacy devices and applications as well as devices and applications with immersive media capabilities.
7. An exemplary media representation framework is provided that facilitates organizing distribution media along scene boundaries.
An improved end-to-end embodiment enabled by the disclosed subject matter is achieved in accordance with the processes and components described in the following detailed description.
Fig. 3 and 4 each employ an exemplary inclusive distribution format (encompassing distribution format) that can be adapted from an ingestion source format to match the capabilities of a particular client endpoint. As described above, the media shown in fig. 3 is timed, while the media shown in fig. 4 is non-timed. The particular inclusion format is sufficiently robust in its structure to accommodate a large number of different media attributes, each of which may be layered based on the significant amount of information each layer contributes to the presentation of the media. It should be noted that the layering process may be applied, for example, in progressive joint image experts group (Joint Photographic Expert Group, JPEG) and scalable video architecture (e.g., specified in ISO/IEC 14496-10 scalable advanced video coding).
According to one aspect, media streamed according to the inclusive media format is not limited to traditional video and audio media, but may include any type of media information capable of producing signals that interact with a machine to stimulate human vision, hearing, taste, touch, and smell.
According to another aspect, the media streamed according to the inclusive media format may be timed media or non-timed media, or a mixture of both timed and non-timed media.
According to another aspect, the inclusive media format is also streamable by enabling a layered representation of media objects using base layer and enhancement layer architectures. In one example, separate base and enhancement layers are calculated by applying multi-resolution or multi-tiling (multi-tiling) analysis techniques to media objects in each scene. This is similar to the progressive rendering image formats specified in ISO/IEC10918-1 (JPEG) and ISO/IEC 15444-1 (JPEG 2000), but is not limited to raster-based visual formats. In one example, the progressive representation of the geometric object may be a multi-resolution representation of the object computed using wavelet analysis.
In another example of a hierarchical representation of a media format, an enhancement layer applies different properties to a base layer, such as refining material properties of a surface of a visual object represented by the base layer. In yet another example, the attribute may refine the texture of the surface of the base layer object, e.g., change the surface from a smooth texture to a porous texture, or from a matte surface to a glossy surface.
In yet another example of a hierarchical representation, the surface of one or more visual objects in a scene may change from a Lambertian surface to a ray-traceable surface.
In yet another example of a hierarchical representation, the network will distribute the base layer representation to clients so that the clients can create a nominal rendering of the scene (nominal presentation) while the clients await the transmission of additional enhancement layers to refine the resolution or other characteristics of the base representation.
According to another aspect, the resolution of the properties or refinement information in the enhancement layer is not explicitly coupled with the resolution of the objects in the base layer as today in existing Moving Picture Experts Group (MPEG) video and JPEG image standards.
According to another aspect, the inclusive media format supports any type of information media that may be presented or actuated by a presentation device or machine, thereby enabling heterogeneous media formats to be supported for heterogeneous client endpoints. In one embodiment of a network that distributes media formats, the network will first query the client endpoint to determine the client's capabilities, and if the client cannot meaningfully ingest the media representation, the network will remove attribute layers that are not supported by the client, or adapt the media from its current format to a format that is suitable for the client endpoint. In one example of such an adaptation, the network would convert the volumetric visual media asset to a 2D representation of the same visual asset using a network-based media processing protocol. In another example of such adaptation, the network may employ a neural network process to reformat the media into an appropriate format or alternatively synthesize the view required by the client endpoint.
According to another aspect, a manifest of a complete or partially complete immersive experience (playback of a live stream event, game, or on-demand asset) is organized by a scene, which is the minimum amount of information that a rendering and game engine can currently ingest to create a presentation. The manifest includes a list of multiple individual scenes to be rendered for the entire immersive experience requested by the client. Associated with each scene is one or more representations of geometric objects within the scene corresponding to a streamable version of the scene geometry. One embodiment of a scene representation refers to a low resolution version of the geometric objects of the scene. Another embodiment of the same scene involves an enhancement layer for a low resolution representation of the scene to add additional detail or to add tiling to the geometric objects of the same scene. As described above, each scene may have more than one enhancement layer, increasing the detail of the geometric objects of the scene in a progressive manner.
According to another aspect, each layer of media objects referenced within a scene is associated with a token (e.g., a uniform resource identifier (Uniform Resource Identifier, URI)) that points to an address where a resource may be accessed within a network. Such resources are similar to CDNs from which clients can obtain content.
According to another aspect, a token for representing a geometric object may point to a location within a network or to a location within a client. That is, the client may signal to the network that its resources are available to the network for network-based media processing.
Fig. 3 illustrates a timed media representation 300 in some examples. The timed media representation 300 depicts an example of an inclusive media format for timed media. The timed scene list 300A includes a list of scene information 301. The scene information 301 refers to a list of components 302 that respectively describe the processing information and the media asset type in the scene information 301. Component 302 refers to asset 303, asset 303 further referring to base layer 304 and attribute enhancement layer 305. In the example of fig. 3, each of the base layers 304 is a digital frequency metric that indicates the number of times an asset is used on the scene set in the presentation. A list of unique assets that have not been previously used in other scenarios is provided in 307. Proxy visual asset 306 includes information to reuse the visual asset, such as reusing a unique identifier of the visual asset, and proxy aural asset 308 includes information to reuse the aural asset, such as reusing a unique identifier of the aural asset.
Fig. 4 illustrates a non-timed media representation 400 in some examples. Non-timed media representation 400 depicts an example of an inclusive media format of non-timed media. An untimed scene manifest (not depicted) references scene 1.0 for which no other scenes can branch to scene 1.0. The scene information 401 is not associated with a start and end duration according to a clock. Scene information 401 refers to a list of components 402 that individually describe processing information and types of media assets that include a scene. Component 402 refers to asset 403, asset 403 further referring to base layer 404 and attribute enhancement layer 405 and attribute enhancement layer 406. In the example of fig. 4, each of the base layers 404 is a digital frequency value that indicates the number of times an asset is used on the scene set in the presentation. Further, the scene information 401 may refer to other scene information 401 for non-timed media. The scene information 401 may also refer to scene information 407 for timing a media scene. List 408 identifies unique assets associated with a particular scene that was not previously used in a higher-order (e.g., parent) scene.
Fig. 5 shows a schematic diagram of a process 500 for synthesizing an ingestion format from natural content. The process 500 includes a first sub-process for content acquisition and a second sub-process for composition of the ingestion format of natural images.
In the example of fig. 5, in a first sub-process, a camera unit may be used to capture natural image content 509. For example, the camera unit 501 may use a single camera lens to capture a scene of a person. The camera unit 502 may collect a scene with five divergent fields of view by mounting five camera lenses around the annular object. The arrangement in the camera unit 502 is an exemplary arrangement for gathering omnidirectional content for VR applications. The camera unit 503 collects a scene having seven converging fields of view by mounting seven camera lenses on an inner diameter portion of the sphere. The arrangement in the camera unit 503 is an exemplary arrangement for capturing a light field of a light field or a light field of a holographic immersive display.
In the example of fig. 5, in a second sub-process, natural image content 509 is synthesized. For example, natural image content 509 is provided as input to the synthesis module 504, in one example, the synthesis module 504 may use the neural network training module 505, and the neural network training module 505 uses the set of training images 506 to generate the acquisition neural network model 508. Another process commonly used to replace the training process is photogrammetry. If the harvesting neural network model 508 is created during the process 500 depicted in fig. 5, the harvesting neural network model 508 becomes one of the assets for the ingestion format 507 of natural content. In some examples, an annotation process 511 may optionally be performed to annotate scene-based media with IMS metadata. Exemplary embodiments of ingestion format 507 include MPI and MSI.
Fig. 6 illustrates a schematic diagram of a process 600 for creating an ingest format for composite media 608 (e.g., a computer-generated image). In the example of fig. 6, a radar (LIDAR) camera 601 collects a point cloud 602 of a scene. A Computer Generated Image (CGI) tool, a 3D modeling tool, or another animation process is employed on computer 603 to create composite content to create 604 a CGI asset over a network. A Motion Capture kit with a sensor 605A is worn over the actor 605 to Capture a digital record of the actor 605's Motion, producing Motion Capture (MoCap) data 606. The data 602, 604, and 606 are provided as inputs to the synthesis module 607, which synthesis module 607 may likewise create a neural network model (not shown in fig. 6), for example, using the neural network and training data. In some examples, the composition module 607 outputs the composition media 608 in ingestion format. The composite media 6608 in ingestion format may then be input to an optional IMS annotation process 609, the output of the IMS annotation process 609 being the IMS annotated composite media 610 in ingestion format.
The techniques for representing, streaming, and processing heterogeneous immersive media in this disclosure can be implemented as computer software using computer-readable instructions and physically stored in one or more computer-readable media. For example, FIG. 7 illustrates a computer system 700 suitable for implementing certain embodiments of the disclosed subject matter.
The computer software may be encoded using any suitable machine code or computer language that may be compiled, interpreted, linked, or the like, to create code comprising instructions that may be executed directly by one or more computer Central Processing Units (CPUs), graphics Processing Units (GPUs), or the like, or by interpretation, microcode execution, or the like.
These instructions may be executed on various types of computers or components thereof, including, for example, personal computers, tablet computers, servers, smartphones, gaming devices, internet of things devices, and the like.
The components of computer system 700 shown in fig. 7 are exemplary in nature and are not intended to suggest any limitation as to the scope of use or functionality of the computer software implementing embodiments of the present disclosure. Nor should the configuration of components be construed as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary embodiment of computer system 700.
The computer system 700 may include some human interface input devices. Such human interface input devices may be responsive to one or more human users' inputs by, for example, the following: tactile input (e.g., key strokes, data glove movements), audio input (e.g., voice, clapping hands), visual input (e.g., gestures), olfactory input (not depicted). The human interface device may also be used to capture certain media that are not necessarily directly related to the conscious input of a person, such as audio (e.g., speech, music, ambient sound), images (e.g., scanned images, photographic images acquired from still image cameras), video (e.g., two-dimensional video, three-dimensional video including stereoscopic video), and so forth.
The input human interface device may include one or more of the following (each only one shown): keyboard 701, mouse 702, touch pad 703, touch screen 710, data glove (not shown), joystick 705, microphone 706, scanner 707, camera 708.
The computer system 700 may also include some human interface output devices. Such human interface output devices may stimulate one or more human user senses through, for example, tactile output, sound, light, and smell/taste. Such human interface output devices may include haptic output devices (e.g., touch screen 710, data glove (not shown) or joystick 705 haptic feedback, but there may also be haptic feedback devices that do not serve as input devices), audio output devices (e.g., speaker 709, headphones (not depicted)), visual output devices (e.g., screen 710 comprising Cathode-ray Tube (CRT) screen, liquid crystal display (Liquid Crystal Display) screen, plasma screen, organic Light-emitting Diode (OLED) screen), each screen with or without touch screen input capability, each screen with or without haptic feedback capability, some of which can output two-dimensional visual output or output beyond three dimensions through means such as stereoscopic output, virtual reality glasses (not depicted), holographic display and smoke box (not depicted), and printer (not depicted).
The computer system 700 may also include human-accessible storage devices and their associated media, such as optical media including CD/DVD ROM/RW (720) with Compact Disc (CD)/digital video Disc (Digital Video Disc, DVD) or similar media 721, finger drive 722, removable hard disk drive or solid state drive 723, conventional magnetic media such as magnetic tape and floppy disk (not depicted), special ROM/ASIC/PLD based devices such as secure dongles (not depicted), and the like.
It should also be appreciated by those skilled in the art that the term "computer-readable medium" as used in connection with the presently disclosed subject matter does not encompass transmission media, carrier waves, or other transitory signals.
Computer system 700 may also include an interface 754 to one or more communication networks 755. The network may be, for example, a wireless network, a wired network, an optical network. The network may also be a local network, wide area network, metropolitan area network, vehicle and industrial network, real-time network, delay tolerant network, and the like. Examples of networks include local area networks such as ethernet, wireless local area networks (local Area Network, LAN), including (Global System for Mobile, GSM), third generation (3 rd generation, 3G), fourth generation (4 th generation, 4G), fifth generation (5 th generation, 5G), long term evolution (Long Term Evolution, LTE), etc., television cable or wireless wide area digital networks including cable television, satellite television, and terrestrial broadcast television, vehicle and industrial television including controller area network bus (Controller Area Network bus, CANbus), and the like. Some networks typically require an external network interface adapter (e.g., a universal serial bus (Universal Serial Bus, USB) port of computer system 700) to attach to some general data port or peripheral bus 749; as described below, other network interfaces are typically integrated into the core of computer system 700 by attaching to a system bus (e.g., an ethernet interface to a PC computer system or a cellular network interface to a smartphone computer system). Computer system 700 may communicate with other entities using any of these networks. Such communication may beOnly unidirectionally received (e.g., broadcast television), only unidirectionally transmitted (e.g., CANbus connected to some CANbus devices), or bi-directionally (e.g., connected to other computer systems using a local area network or a wide area network digital network). As described above, certain protocols and protocol stacks may be used on each of those networks and network interfaces.
The human interface devices, human accessible storage devices, and network interfaces described above may be attached to the core 740 of the computer system 700.
The core 740 may include one or more Central Processing Units (CPUs) 741, graphics Processing Units (GPUs) 742, special purpose programmable processing units in the form of field programmable gate areas (Field Programmable Gate Area, FPGAs) 743, hardware accelerators (744) for certain tasks, graphics adapters 750, and the like. These devices may be connected through a system bus 748 together with Read Only Memory (ROM) 745, random access memory 746, internal mass storage 747 such as an internal non-user accessible hard disk drive, solid state drive (Solid State Drive, SSD), etc. In some computer systems, the system bus 748 may be accessed in the form of one or more physical plugs to allow expansion by additional CPUs, GPUs, and the like. Peripheral devices may be attached directly to the system bus 748 of the kernel or connected to the system bus 748 of the kernel by a peripheral bus 749. In one example, screen 710 may be connected to graphics adapter 750. The architecture of the peripheral bus includes peripheral component interconnect (Peripheral Component Interconnection, PCI), USB, etc.
CPU 741, GPU 742, FPGA 743, and accelerator 744 may execute certain instructions that, in combination, may constitute the computer code described above. The computer code may be stored in ROM 745 or RAM 746. Transition data may also be stored in RAM 746, while persistent data may be stored, for example, in internal mass storage 747. Fast storage and retrieval of any storage device may be achieved through the use of a cache, which may be closely associated with one or more of the following: CPU 741, GPU 742, mass storage 747, ROM 745, and RAM 746.
The computer-readable medium may have thereon computer code that performs various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind well known and available to those having skill in the computer software arts.
By way of example, and not limitation, a computer system having architecture 700, and in particular kernel 740, may be caused to provide functionality by virtue of processor(s) (including CPU, GPU, FPGA, accelerators, etc.) executing software embodied in one or more tangible computer-readable media. Such computer readable media may be media associated with mass storage accessible to the user as described above, as well as memory of some non-transitory cores 740, such as in-core mass storage 747 or ROM 745. Software implementing various embodiments of the present disclosure may be stored in such devices and executed by kernel 740. The computer-readable medium may include one or more memory devices or chips, according to particular needs. The software may cause the kernel 740, and in particular the processor therein (including CPU, GPU, FPGA, etc.), to perform certain processes or certain portions of certain processes described herein, including defining data structures stored in RAM 746 and modifying such data structures according to the processes defined by the software. Additionally or alternatively, the computer system may be caused to provide functionality due to logic hardwired or otherwise embodied in circuitry (e.g., accelerator 744), which may operate in place of or in conjunction with software to perform certain processes or certain portions of certain processes described herein. Where appropriate, reference to portions of software may include logic, and vice versa. References to portions of computer-readable medium may include circuits storing software for execution, such as Integrated Circuits (ICs), circuits embodying logic for execution, or both, where appropriate. This application includes any suitable combination of hardware and software.
Fig. 8 illustrates a network media distribution system 800. In some examples, the network media distribution system 800 supports various legacy displays and displays with heterogeneous immersive media capabilities as client endpoints. In the example of fig. 8, the content acquisition module 801 collects or creates media using the example embodiment of fig. 6 or 5. An ingest format is created in content preparation module 802 and then transmitted to one or more client endpoints in the network media distribution system using transmission module 803. Gateway 804 may serve customer premises equipment to provide network access to various client endpoints of the network. The set top box 805 may also be used as a customer premises equipment to provide access to the aggregated content by a network service provider. The wireless demodulator 806 may be used as a mobile network access point for the mobile device (e.g., as with the mobile handset and display 813). In one or more embodiments, the legacy 2D television 807 may be directly connected to the gateway 804, the set top box 805, or the wireless fidelity (Wireless Fidelity, wiFi) router 808. A laptop computer with a conventional 2D display 809 may be a client endpoint connected to a WiFi router 808. A head mounted 2D (raster based) display 810 may also be connected to router 808. A lenticular light field display 811 may be connected to gateway 804. The display 811 may be comprised of a local computing GPU 811A, a storage device 811B, and a visual presentation unit 811C that creates multiple views using ray-based lenticular optics. The holographic display 812 may be connected to the set top box 805 and may include a local computing CPU 812A, GPU B, a storage device 812C, and a fresnel pattern, wave based holographic visualization unit 812D. The augmented reality earpiece 814 may be connected to the wireless demodulator 806 and may include a GPU 814A, a storage device 814B, a battery 814C, and a volumetric visual rendering component 814D. Dense light field display 815 may be connected to WiFi router 808 and may include a plurality of GPUs 815A, CPU 815B and storage devices 815C, eye-tracking devices 815D, cameras 815E, and dense light ray based light field panels 815F.
Fig. 9 shows a schematic diagram of an immersive media distribution module 900 capable of serving both the legacy display previously depicted in fig. 8 and a display having heterogeneous immersive media capabilities. Content is created or retrieved in modules 901 respectively embodied for natural content and CGI content in fig. 5 and 6. The content is then converted to an ingest format using create network ingest format module 902. Some examples of module 902 are embodied in fig. 5 and 6 for natural content and CGI content, respectively. In some examples, the media analyzer 911 may perform appropriate media analysis on the ingested media and may update the ingested media to store information from the media analyzer 911. In one example, a scene analyzer with optional IMS annotation functionality in the media analyzer 911 updates the ingest media to have IMS metadata annotations. The ingest media format is transmitted to a network and stored on storage device 903. In some other examples, the storage device may reside in the network of the immersive media content producer and be accessed remotely by the immersive media network distribution module 900, as depicted by the dashed line bisecting 903. In some examples, the client and application specific information is available on the remote storage device 904, in one example, the remote storage device 904 may optionally be remotely present in an alternative cloud network.
As depicted in fig. 9, the network orchestrator 905 serves as the primary information source and information pool to perform the primary tasks of the distribution network. In this particular embodiment, network orchestrator 905 may be implemented in a uniform format with other components of the network. However, in some examples, the tasks depicted by network orchestrator 905 in fig. 9 form elements of the disclosed subject matter. The network orchestrator 905 may be implemented using software, and the software may be executed by processing circuitry to perform a process.
According to some aspects of the disclosure, the network orchestrator 905 may also employ a two-way messaging protocol to communicate with client devices to facilitate the processing and distribution of media (e.g., immersive media) according to characteristics of the client devices. Furthermore, the bi-directional message protocol may be implemented across different transport channels (i.e., control plane channels and data plane channels).
The network orchestrator 905 receives information about characteristics and properties of client devices, such as the client 908 (also referred to as the client device 908) in fig. 9, and further gathers requirements about applications currently running on the client 908. This information may be obtained from the device 904 or, in alternative embodiments, may be obtained by directly querying the client 908. In some examples, a two-way message protocol is used to enable direct communication between network orchestrator 905 and client 908. For example, the network orchestrator 905 may send a direct query to the client 908. In some examples, smart client 908E may participate in the collection and reporting of client status and feedback on behalf of client 908. Smart client 908E may be implemented using software that is executable by processing circuitry to perform processes.
The network orchestrator 905 also initiates and communicates with a media adaptation and segmentation module 910, the media adaptation and segmentation module 910 being described in fig. 10. When the media adaptation and segmentation module 910 adapts and segments the ingest media, in some examples, the media is transferred to an intermediate media storage device depicted as media in preparation for distribution to the storage device 909. When the distributed media is prepared and stored in the device 909, the network orchestrator 905 ensures that the immersive client 908 receives the distributed media and corresponding descriptive information 906 via its network interface 908B by push requests, or the client 908 itself can initiate a pull request for the media 906 from the storage device 909. In some examples, network orchestrator 905 may employ a two-way message interface (not shown in fig. 9) to perform the "push" request or initiate the "pull" request by immersive client 908. In one example, the immersive client 908 can employ a network interface 908B, GPU (or CPU not shown) 908C and a storage device 908D. Further, the immersive client 908 can employ the game engine 908A. The game engine 908A may also employ a visualization component 908A1 and a physics engine 908A2. The game engine 908A communicates with the Smart client 908E to orchestrate the processing of media via the game engine APIs and callback functions 908F. The distribution format of the media is stored in a storage device of the client 908 or in the storage cache 908D. Finally, the client 908 visually presents media via the visualization component 908A 1.
Throughout streaming the immersive media to the immersive client 908, the network orchestrator 905 may check the status of the client progress via the client progress and status feedback channel 907. The checking of the status may be performed by means of a bi-directional communication message interface (not shown in fig. 9), which may be implemented in the smart client 908E.
Fig. 10 depicts a schematic diagram of a media adaptation process 1000 in some examples such that ingested source media can be appropriately adapted to match the requirements of the immersive client 908. The media adaptation and segmentation module 1001 includes a number of components that facilitate adapting ingest media to an appropriate distribution format for the immersive client 908. In fig. 10, a media adaptation and segmentation module 1001 receives an input network state 1005 to track the current traffic load on the network. The immersive client 908 information may include attribute and property descriptions, application properties and descriptions, and application current state, as well as a client neural network model (if available) to help map the geometry of the client's frustum to the interpolation capabilities of the ingested immersive media. Such information may be obtained by means of a two-way message interface (not shown in fig. 10) by means of a smart client interface shown as 908E in fig. 9. The media adaptation and segmentation module 1001 ensures that the adapted output is stored in the client adapted media storage 1006 when it is created. In fig. 10, the media analyzer 1007 is depicted as a process that may be performed as a priority process or as part of a network automation process for distributing media. In some examples, the media analyzer 1007 includes a scene analyzer with optional IMS annotation functionality.
In some examples, the network orchestrator 1003 initiates the adaptation process 1001. The media adaptation and segmentation module 1001 is controlled by a logic controller 1001F. In one example, the media adaptation and segmentation module 1001 employs a renderer 1001B or a neural network processor 1001C to adapt a particular ingest source media to a format suitable for a client. In one example, the media adaptation and segmentation module 1001 receives client information 1004 from a client interface module 1003 (such as a server device in the example). The client information 1004 may include a client description and a current state, may include an application description and a current state, and may include a client neural network model. The neural network processor 1001C uses the neural network model 1001A. Examples of such a neural network processor 1001C include a depth view (deep) neural network model generator described in MPI and MSI. In some examples, the media is in 2D format, but the client needs 3D format, then the neural network processor 1001C may call a procedure to derive a volumetric representation of the scene depicted in the video using the highly correlated images from the 2D video signal. An example of such a process may be the neural radiation field process developed by the university of california, berkeley division from one or several images. An example of a suitable renderer 1001B may be a modified version of an OTOY rendering renderer (not shown) that is to be modified to interact directly with the media adaptation and segmentation module 1001. In some examples, the media adaptation and segmentation module 1001 may employ the media compressor 1001D and the media decompressor 1001E as needed for these tools for the format of the ingested media and the format required by the immersive client 908.
Fig. 11 illustrates a distribution format creation process 1100 in some examples. The adapted media grouping module 1103 groups media from a media adaptation module 1101 (depicted in fig. 10 as process 1000) that is now resident on the client adapted media storage device 1102. The grouping module 1103 formats the adapted media from the media adaptation module 1101 into a robust distribution format 1104, such as the exemplary format shown in fig. 3 or fig. 4. Inventory information 1104A provides clients 908 with a list of scene data assets 1104B that they may desire to receive, as well as optional metadata describing the frequency with which each asset is used in the scene set that includes the presentation. List 1104B depicts a list of visual, auditory, and tactile assets, each with its corresponding metadata. In this exemplary embodiment, each asset reference in list 1104B contains metadata of a digital frequency value that indicates the number of times a particular asset was used in all of the scenes comprising the presentation.
Fig. 12 illustrates a packetizer processing system 1200 in some examples. In the example of fig. 12, the packetizer 1202 divides the adapted media 1201 into a plurality of individual packets 1203 suitable for streaming to the immersive client 908, which is shown as a client endpoint 1204 on a network.
Fig. 13 illustrates a sequence diagram 1300 of adapting particular immersive media in an ingestion format to a network of streamable and suitable distribution formats for the particular immersive media client endpoint in some examples.
The components and communications shown in fig. 13 are explained as follows: a client 1301 (also referred to as a client endpoint, client device in some examples) initiates a media request 1308 to a network orchestrator 1302 (also referred to as a network distribution interface in some examples). The media request 1308 includes information identifying the media requested by the client 1301 by a uniform resource name (Uniform Resource Name, URN) or other standard terminology. The network orchestrator 1302 responds to the media request 1308 with a profile request 1309, the profile request 1309 requesting that the client 1301 provide information about its current available resources (including calculation, storage, battery charge percentages, and other information characterizing the client's current operating state). The profile request 1309 also requests that the client provide one or more neural network models that, if available at the client, can be used by the network to conduct neural network reasoning to extract or interpolate the correct media view to match the characteristics of the client's presentation system. A response 1310 from the client 1301 to the network orchestrator 1302 provides a client token, an application token, and one or more neural network model tokens (if such neural network model tokens are available at the client). The network orchestrator 1302 then provides the session ID token 1311 to the client 1301. The network orchestrator 1302 then requests the ingest media server 1303 using ingest media request 1312, ingest media request 1312 including the URN or standard naming name of the media identified in request 1308. The ingest media server 1303 replies to the request 1312 with a response 1313 that includes an ingest media token. The network orchestrator 1302 then provides the media token from response 1313 to the client 1301 in call 1314. The network orchestrator 1302 then initiates the adaptation process of the media request 1308 by providing the ingest media token, the client token, the application token, and the neural network model token 1315 to the adaptation interface 1304. The adaptation interface 1304 requests access to the ingest media by invoking 1316 to provide an ingest media token to the ingest media server 1303 to request access to ingest media assets. In response 1317 to the adaptation interface 1304, the ingest media server 1303 responds to the request 1316 with an ingest media access token. The adaptation interface 1304 then requests that the media adaptation module 1305 adapt the ingest media at the ingest media access token for the client, application, and neural network inference model corresponding to the session ID token created at 1313. The request 1318 from the adaptation interface 1304 to the media adaptation module 1305 contains the required token and session ID. In update 1319, media adaptation module 1305 provides the adapted media access token and session ID to network orchestrator 1302. Network orchestrator 1302 provides the adapted media access token and session ID to packet module 1306 in interface call 1320. The grouping module 1306 provides the response 1321 with the grouped media access token and session ID to the network orchestrator 1302 in a response message 1321. In response 1322, the grouping module 1306 provides the grouping asset, URN, and grouping media access token for the session ID to the grouping media server 1307. Client 1301 executes request 1323 to initiate streaming of a media asset corresponding to the packet media access token received in response message 1321. Client 1301 performs other requests and provides status updates to network orchestrator 1302 in message 1324.
Fig. 14 shows a schematic diagram of a media system 1400 with a hypothetical network and a client device 1418 (also referred to as a game engine client device 1418) for scene-based media processing in some examples. In fig. 4, a smart client, such as that shown in MPEG smart client process 1401, can be used as a central orchestrator to prepare media for processing by other entities within game engine client device 1418 as well as entities residing outside of game engine client device 1418. In some examples, the smart client is implemented as software instructions executable by processing circuitry to perform a process, such as an MPEG smart client process 1401 (also referred to as an MPEG smart client 1401 in some examples). The game engine 1405 is primarily responsible for rendering media to create a presentation that the end user(s) will experience. The haptic component 1413, visualization component 1415, and audio component 1414 can assist the game engine 1405 in rendering haptic, video, and audio media, respectively. The edge processor or network orchestrator device 1408 may transmit information and system media to the MPEG smart client 1401 via the network interface protocol 1420, and receive status updates and other information from the MPEG smart client 1401. The network interface protocol 1420 may be partitioned across and employ multiple communication channels and processes. In some examples, game engine 1405 is a game engine device that includes control logic 14051, GPU interface 14052, physics engine 14053, renderer(s) 14054, compression decoder(s) 14055, and device-specific plug-in(s) 14056. The MPEG Smart client 1401 also serves as the primary interface between the network and the client device 1418. For example, MPEG smart client 1401 may employ game engine APIs and callback functions 1417 to interact with game engine 1405. In one example, the MPEG smart client 1401 may be responsible for reconstructing the transferred streamed media in 1420 before invoking the game engine APIs and callback functions 1417, the callback functions 1417 being managed by the game engine control logic 14051 to cause the game engine 1405 to process the reconstructed media. In such an example, the MPEG smart client 1401 can utilize a client media reconstruction process 1402, which in turn, the client media reconstruction process 1402 can also utilize a compression decoder process 1406.
In some other examples, MPEG smart client 1401 may not be responsible for reconstructing the media of the packets streamed in 1420 before calling API and callback function 1417. In such examples, the game engine 1405 may decompress and reconstruct the media. Further, in such examples, the game engine 1405 may employ compression decoder(s) 14055 to decompress the media. Upon receiving the reconstructed media, game engine control logic 14051 may employ GPU interface 14052 to render the media via renderer process(s) 14054.
In some examples, the rendered media is animated and then the game engine control logic 14051 may use the physics engine 14053 to simulate the laws of physics in the animation of the scene.
In some examples, the neural network processor 1403 may use the neural network model 1421 to facilitate operations orchestrated by the MPEG smart client 1401 throughout the processing of media by the client device 1418. In some examples, the reconstruction process 1402 may require the use of a neural network model 1421 and a neural network processor 1403 to fully reconstruct the media. Similarly, the client device 1418 may be configured by a user via the user interface 1412 to: media received from the network is cached in the client adapted media cache 1404 after the media has been reconstructed or, once the media has been rendered, the rendered media is cached in the rendered client media cache 1407. Further, in some examples, MPEG smart client 1401 can replace system-provided visual/non-visual assets with user-provided visual/non-visual assets from user-provided media cache 1416. In such embodiments, the user interface 1412 may direct the end user to perform steps to load the user-provided visual/non-visual assets from the user-provided media cache 1419 (e.g., external to the client device 1418) to the client-accessible, user-provided media cache 1416 (e.g., internal to the client device 1418). In some embodiments, MPEG smart client 1401 may be configured to store rendering assets (for potential reuse or sharing with other clients) in rendered media cache 1411.
In some examples, the media analyzer 1410 may examine the client-adapted media 1409 (in the network) for potential prioritization of rendering performed by the game engine 1405 and/or for a reconstruction process via the MPEG smart client 1401 to determine the complexity of the asset or the frequency with which the asset is reused in one or more scenarios (not shown). In such an example, media analyzer 1410 would store complexity, priority, and asset usage frequency information in the media stored in 1409.
It should be noted that in this disclosure, although a process is shown and described, the process may be implemented as instructions in a software module and the instructions may be executed by a processing circuit to perform the process. It should also be noted that in this disclosure, while a module is shown and described, the module may be implemented as a software module having instructions, and that these instructions may be executed by processing circuitry to perform the processing.
In some related examples, various techniques may solve the problem of providing a smooth scene flow for a client, including using a Content Delivery Network (CDN) and edge network elements to reduce delay between a client device's request for a scene and the appearance of the scene at the client device, and using cloud/edge network elements to offload rendered computational load to a more powerful computational engine. Although these techniques may shorten the delay between the client device's request for a scenario and the appearance of the scenario at the client device, these techniques rely on the availability of a CDN and the immersive scenario provider may rely on the terminal device requesting the scenario before the data of the scenario can be streamed to the terminal device, whether or not the CDN and network edge device are used.
According to a first aspect of the disclosure, a prioritized scene streaming method for immersive media streaming may be used for immersive media streaming of light field based displays (e.g., light field displays, holographic displays, etc.). In some examples, the priority of the scenes is determined by the relationship between the scenes.
In some examples, during operation, the media server may use adaptive streaming to stream media data to a terminal device, such as a terminal device having a light field based display. In one example, when the end device requests the entire immersive environment simultaneously, the ordering of scenes that become available may not provide an acceptable user experience.
Fig. 15 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 15, the scene-based immersive media is represented by scene list 1501, scene list 1501 comprising a list of a plurality of individual scenes to be rendered for the immersive experience. The scene manifest 1501 is provided by the media server 1502 to the terminal device 1504 via the cloud 1503 (e.g., a network cloud). In the example of fig. 15, scenes are retrieved by the terminal device 1504 and sent to the terminal device 1504 in the order of scene numbers in the scene list 1501. For example, the scenes in scene list 1501 are ordered by scene 1, scene 2, scene 3, scene 4, scene 5, etc. Scenes are retrieved and transmitted to the terminal device 1504 in order of scene 1 (shown by 1505), scene 2 (shown by 1506), scene 3 (shown by 1507), scene 4 (shown by 1508), scene 5 (shown by 1509), and the like.
In some examples, each scene in the scene manifest may include, for example, in metadata of the scene: a complexity value indicating the complexity of the scene, and a priority value indicating the priority of the scene.
FIG. 16 shows a schematic diagram for adding complexity values and priority values to scenes in a scene manifest in some examples. In the example of fig. 16, a scene complexity analyzer 1602 is used to analyze the complexity of each scene and assign a complexity value to each scene, and then a scene prioritizer (1603) is used to assign a priority value to each scene. For example, the first scene manifest 1601 includes a plurality of scenes to be rendered for the immersive experience. The first scene list 1601 is processed by a scene complexity analyzer 1602, the scene complexity analyzer 1602 determines a complexity value (shown as SC) for each of a plurality of scenes, and then processed by a scene prioritizer 1603, the scene prioritizer 1603 assigning a Priority value (shown as Priority) to each of the plurality of scenes, thereby generating a second scene list 1604. The second scene list 1604 includes a plurality of scenes with corresponding complexity values and priority values. The plurality of scenes in the second scene list 1604 are reordered according to the complexity value and/or the priority value. For example, scene 5 is ordered as a fifth scene in the first scene list 1601 (as shown at 1605), and scene 5 is reordered as a third scene in the second scene list 1604 (as shown at 1606) according to the priority value.
Further, in some examples, a priority-aware media server (priority-aware media server) may be used to stream scenes to terminal devices in an order of their priority values.
Fig. 17 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 17, the scene-based immersive media is represented by scene manifest 1701, the scene manifest 1701 comprising a list of a plurality of individual scenes to be rendered for the immersive experience. Each of the plurality of scenes has an analyzed complexity value (shown by SC) and an assigned Priority value (shown by Priority). The scene manifest 1701 is provided by the priority aware media server 1702 to the terminal device 1704 via the cloud 1703. In one example, scenes are retrieved by the terminal device 1704 and sent to the terminal device 1704 in the order of priority values in the scene list 1701. For example, scenes in scene list 1701 are ordered according to priority values. Scenes in the order of scene 7 (shown by 1705), scene 8 (shown by 1706), scene 5 (shown by 1707), scene 2 (shown by 1708), scene 11 (shown by 1709), etc. are retrieved and transmitted to the terminal device 1704.
According to one aspect of the disclosure, a streaming technique based on scene priority may be used to provide benefits as compared to requesting all scenes in an immersive environment at the same time, as there are reasons that end devices may need to render scenes "out of order" in many cases. For example, the scenes may be initially ordered in the most useful order typically used in an immersive environment, and the immersive environment may include branches (branches), loops (loops), etc., which may require rendering the scenes in an order different from the most useful order. In one example, when scenes are transmitted in the most useful order, there is enough bandwidth to potentially provide an acceptable user experience, and there may be situations where a desired scene is queued after one or more scenes have been requested. In another example, when bandwidth is limited, streaming techniques based on scene priority may be more conducive to improving the user experience.
It should be noted that the scene prioritizer may use any suitable technique to determine the priority value of the scene. In some examples, the scene prioritizer may use relationships between scenes as part of prioritization.
Fig. 18 shows a schematic diagram of a virtual museum 1801 to illustrate scene priorities in some examples. The virtual museum 1801 has a floor 1802 and an upper floor 1803, and a stairwell 1804 between the floor 1802 and the upper floor 1803. In one example, the immersive tour begins at lobby 1805. The lobby 1805 has five exits for exiting the first exit of the virtual museum 1801, for entering the second exit of the stairwell 1804, for entering the third exit of the floor display E, for entering the fourth exit of the floor display D, and for entering the fifth exit 1802 of the floor display H. In one example, the scene prioritizer may reasonably prioritize as follows: the first scene of lobby 1805 is followed by the second scene of stairwell 1804 and the third scene of bottom exhibits E, D and H, because bottom exhibits E, D and H are possible "next" scenes to be rendered when a virtual guest leaves lobby 1805.
In some examples, the scene prioritizer may adjust scene priorities based on previous experiences with the same client or with other clients. Referring to fig. 18, in one example, a virtual museum 1801 presents a particularly popular exhibit (e.g., upper level exhibit I) in an upper level exhibit room 1808, and the scene prioritizer may observe many virtual guests running popular exhibits. The scene prioritizer may then adjust the previously provided priorities. For example, the scene prioritizer may prioritize as follows: the first scene of hall 1805, the second scene of stairwell 1804, then the scene of upper floor 1803 (e.g., the scene of upper floor exhibition room J) and the scene of upper floor exhibition room I on the way to upper floor exhibition room 1808.
In some examples, the content description provided by the media server for the immersive media may include two portions: media presentation descriptions (Media Presentation Description, MPD) describing a manifest of available scenes, various alternatives, and other features; and multiple scenarios with different assets. In some examples, the terminal device may first obtain an MPD of the immersive media to be played. The terminal device may parse the MPD and learn various scenes with different assets, scene timings, media content availability, media types, various encoded alternatives to the media content, supported minimum and maximum bandwidths, and other content characteristics. Using the information obtained from the MPD, the terminal device may appropriately select which scene to render at what time and under what bandwidth availability. The terminal device may continuously measure bandwidth fluctuations and, based on the measurements, the terminal device decides how to adapt the available bandwidth by acquiring alternative scenarios with fewer or more assets.
In some examples, the priority value of a scene in the scene-based immersive media is defined by a server device or a sender device and may be changed by a terminal device during a session for playing the scene-based immersive media.
According to a second aspect of the present disclosure, a priority value for a scene in a scene-based immersive media for a light field based display may be dynamically adjusted. In some examples, the priority value may be updated based on feedback from a terminal device (also referred to as a client device). For example, the scene prioritizer may receive feedback from the terminal device and may dynamically change the priority value of the scene based on the feedback.
Fig. 19 illustrates a scene graph of scene-based immersive media (e.g., immersive game 1901) in some examples. The relationship between scenes is represented by the dashed line connecting the scenes. In one example, the game graphic 1902 appears in the production region 1903, and then the scene with the highest priority is the scene of the production region 1903. The scenes of the production area 1903 may be followed by scenes directly connected to the production area 1903, where the game graphic 1902 begins in the production area 1903, such as scene 1, scene 11, and scene 20. When game graphic 1902 moves to scenario 1 (shown by 1904), the scenario that is directly connected to scenario 1, such as scenario 2 (shown by 1905), now has the highest priority. When game graphic 1902 moves to scenario 2, the scenarios directly connected by scenario 2, such as scenario 12 and scenario 5, now have the highest priority.
In some examples, when game piece 1902 moves to a new scene that has not yet been streamed to the terminal device, the terminal device may send the current location of the game piece to the scene prioritizer so that the current location may be included in the prioritization.
Fig. 20 shows a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 20, the scene-based immersive media is represented by scene manifest 2001, scene manifest 2001 comprising a list of multiple individual scenes to be rendered for the immersive experience. Each of the plurality of scenes has an analyzed complexity value (shown by SC) and an assigned Priority value (shown by Priority). Scene manifest 2001 is provided by priority aware media server 2002 to terminal device 2004 via cloud 2003. In the example of fig. 20, the terminal device 2004 retrieves scenes and transmits the scenes to the terminal device 2004 in order of priority in the scene list 2001.
Further, using the immersive game of fig. 19 as an example, the terminal device 2004 may send the current position 2010 of the game character to the scene prioritizer 2011. Then, the scene prioritizer 2011 may include the current position information in the calculation for updating the scene priorities in the scene list 2001.
It should be noted that the techniques for adaptive streaming of light field based media as described above may be implemented as computer software using computer readable instructions and physically stored in one or more computer readable media.
In some examples, a method for adaptive streaming of a light field based display includes assigning respective priority values to a plurality of scenes in an immersive media of the light field based display, and adaptively streaming the plurality of scenes based on the priority values. In one example, multiple scenes are transmitted (e.g., streamed) based on the assigned priority value. In some examples, the priority value of the scene is assigned based on a likelihood that the scene needs to be rendered. In some examples, the priority value of the scene is dynamically adjusted based on feedback from the terminal device. For example, the priority value of the scene may be adjusted based on, for example, the position of the game piece (i.e., feedback from the client device).
In some examples, in response to the available network bandwidth being limited, the scene with the highest priority is selected for streaming as the next scene. In some examples, in response to the available network bandwidth being limited, the method includes identifying a plurality of scenes that are unlikely to be needed and avoiding streaming the identified plurality of scenes until the identified plurality of scenes exceeds a likelihood of need threshold.
Fig. 21 shows a flowchart outlining a process 2100 according to an embodiment of the present disclosure. Process 2100 may be performed in a network, for example by a server device in the network. In some embodiments, process 2100 is implemented in software instructions, so when the processing circuitry executes the software instructions, the processing circuitry executes process 2100. The process starts from S2101 and proceeds to S2110.
At S2110, scene-based immersive media is received for play on the light field-based display. Scene-based immersive media includes a plurality of scenes.
At S2120, priority values are assigned to a plurality of scenes in the scene-based immersive media, respectively.
At S2130, an ordering to stream the plurality of scenes to a terminal device having a light field based display is determined according to the priority value.
In some examples, the plurality of scenes have an initial ordering in the scene manifest and are reordered according to the priority value and the reordered scenes are transmitted (streamed) to the terminal device.
In some examples, the priority aware network device may select a highest priority scene having a highest priority value from a subset of untransmitted scenes of the plurality of scenes and transmit the highest priority scene to the terminal device.
In some examples, the priority value of the scene is determined based on a likelihood that the scene needs to be rendered.
In some examples, it is determined that the available network bandwidth is limited. The highest priority scene having the highest priority value is selected from a subset of the untransmitted scenes of the plurality of scenes. The highest priority scenario is transmitted in response to the available network bandwidth being limited.
In some examples, it is determined that the available network bandwidth is limited. A subset of the plurality of scenes that is unlikely to be needed for a next rendering is identified based on the priority value. In response to the available network bandwidth being limited, streaming the subset of the plurality of scenes is avoided.
In some examples, the first priority value is assigned to the first scene based on the second scene having the second priority value in response to a relationship between the first scene and the second scene.
In some examples, a feedback signal is received from a terminal device. At least a priority value of a scene of the plurality of scenes is adjusted based on the feedback signal. In one example, the highest priority is assigned to the first scene in response to a feedback signal indicating that the current scene is a second scene associated with the first scene. In another example, the feedback signal indicates a priority adjustment determined by the terminal device.
Then, the process 2100 proceeds to S2199 and ends.
The process 2100 may adapt to various scenarios as appropriate, and the steps in the process 2100 may be adjusted accordingly. One or more steps of process 2100 may be modified, omitted, repeated, and/or combined. Process 2100 may be implemented using any suitable order. Additional step(s) may be added.
Fig. 22 shows a flowchart outlining a process 2200 in accordance with an embodiment of the present disclosure. Process 2200 can be performed by a terminal device (also referred to as a client device). In some embodiments, process 2200 is implemented in software instructions, so that when the processing circuitry executes the software instructions, the processing circuitry performs process 2200. The process starts from S2201 and proceeds to S2210.
At S2210, a terminal device having a light field based display receives a Media Presentation Description (MPD) of scene-based immersive media for play by the light field based display. The scene-based immersive media includes a plurality of scenes, and the MPD indicates that the plurality of scenes are streamed to the terminal device in a ranking.
At S2220, bandwidth availability is detected.
At S2230, a ranking change of at least one scene is determined based on the bandwidth availability.
At S2240, a feedback signal indicating a ranking change of at least one scene is transmitted.
In some examples, the feedback signal indicates a next scene to be rendered.
In some examples, the feedback signal indicates an adjustment of a priority value of the at least one scene.
In some examples, the feedback signal indicates the current scene.
Then, the process 2200 proceeds to S2299 and ends.
The process 2200 may adapt the various scenarios appropriately and the steps in the process 2200 may be adjusted accordingly. One or more steps of process 2200 may be modified, omitted, repeated, and/or combined. Process 2200 may be implemented using any suitable order. Additional step(s) may be added.
Some aspects of the present disclosure also provide prioritized asset streaming methods for immersive media streaming (e.g., immersive media streaming for light field-based displays).
It should be noted that in response to a client request from a client device, the appearance of a scene at the client device (also referred to as a terminal device) may depend on the availability of assets to the client device on which the scene depends, such that the client device may render the scene. To minimize the amount of time required to transfer the asset to the client device, the assets may be prioritized in some examples.
According to a third aspect of the disclosure, assets in a scene-based immersive media (e.g., scene-based immersive media for a light field-based display (e.g., a light field display, a holographic display, etc.) may be prioritized based on attributes of the assets (e.g., asset size of the assets).
In some examples, during operation, the media server may use adaptive streaming for light field based displays. Adaptive streaming relies on the transfer of scenes to terminal devices and, in turn, on the transfer of assets that make up each scene.
FIG. 23 shows a schematic diagram of a scene manifest 2301 with a mapping of scenes to assets in one example. The scene manifest 2301 includes a plurality of scenes 2303, and each of the plurality of scenes 2303 is dependent on one or more assets in the asset set 2302. For example, scenario 2 relies on asset B, asset D, asset E, asset F, and asset G in asset set 2302.
In some examples, when a terminal device requests an asset of a scene, the asset is streamed to the terminal device in a rank order.
FIG. 24 shows a schematic diagram of streaming assets for a scene in some examples. In fig. 24, scenario 2401 relies on a plurality of assets, such as asset a (shown by 2405), asset B (shown by 2406), asset C (shown by 2407), and asset D (shown by 2408). The plurality of assets are provided by the media server 2402 to the terminal device 2404 through the cloud 2403. In the example of FIG. 24, the plurality of assets in scene 2401 are ordered by asset A, asset B, asset C, and asset D. Through the ordering of assets in scene 2401, e.g., in order of asset a, asset B, asset C, and asset D, a plurality of assets are retrieved and sent to terminal device 1504. The ordering of assets that become available to the terminal device may not provide an acceptable user experience.
According to one aspect of the disclosure, an asset prioritizer may be used to analyze assets required by a scene and assign a priority value to each asset before the assets of the scene are streamed to a terminal device.
FIG. 25 shows a schematic diagram for reordering assets in a scene in some examples. In the example of fig. 25, an asset prioritizer 2502 is used to assign a priority value to each asset in a scene. For example, one scenario relies on four assets, referred to as asset a (as shown at 2504), asset B (as shown at 2505), asset C (as shown at 2506), and asset D (as shown at 2507). Initially, the assets of the four assets are ordered as asset a, asset B, asset C, and asset D, as shown at 2501. Asset prioritizer 2502 assigns a first priority value (e.g., 2) to asset a, a second priority value (e.g., 4, highest priority value) to asset B, a third priority value (e.g., 1, lowest priority value) to asset C, and a fourth priority value (e.g., 3) to asset D. The four assets may be reordered according to the priority value, e.g., ordered as asset B, asset D, asset a, and asset C, as shown at 2503.
In one example, the reordered assets may be streamed to the terminal device by the media server.
It should be noted that while greater values are used in this disclosure to represent higher priorities, other suitable priority numbering techniques may be used to present asset priorities.
Fig. 26 shows a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 26, the scene of the scene-based immersive media includes four assets that are reordered according to priority values of the four assets, e.g., from high to low as asset B, asset D, asset a, and asset C, as shown at 2601. The media server 2602 provides the four assets to the terminal device 2604 in an order of priority values through the cloud 2603. For example, the four assets are retrieved and sent to terminal device 2604 in the order of asset B (shown by 2605), asset D (shown by 2606), asset a (shown by 2607), and asset C (shown by 2608).
In accordance with one aspect of the disclosure, asset prioritizer 2502 may assign a priority value to assets based on the size of each asset in bytes. In one example, the smallest asset is assigned the highest priority value so that a series of assets can be rendered in an order that shows the end user more assets (e.g., smaller assets) faster while the end device is still waiting for the larger asset to arrive. Taking fig. 26 as an example, the number of bytes of asset B is the lowest (smallest size) and the number of bytes of asset C is the highest (largest size). Comparing the first streaming ordering in fig. 24 with the second streaming ordering in fig. 26, terminal device 2604 may begin displaying the asset earlier than terminal device 2404 because in fig. 24, the relatively larger asset is streamed first and terminal device 2404 may not have nothing to display before the relatively larger asset arrives.
In some examples, the prioritization assigned by the asset prioritizer may also be included in a manifest in the MPD. In some examples, asset priority values for assets in a scene in the scene-based immersive media may be defined by a server device or a sender device, and the asset priority values may be changed by a client device during a session for playing the scene-based immersive media.
Some aspects of the present disclosure provide a method for dynamic adaptive streaming of a light field based display (e.g., a light field display or a holographic display) according to an assigned asset priority value. Assets with assigned asset priority values may be transmitted (streamed) based on the asset priority values. In some examples, asset priority values are assigned to assets based on the size of the assets. In some examples, the asset priority value is adjusted based on feedback from one or more client devices.
According to a fourth aspect of the present disclosure, assets in a scene-based immersive media, such as scene-based immersive media for a light field-based display (e.g., a light field display, a holographic display, etc.), may be prioritized based on asset visibility.
According to one aspect of the disclosure, not all assets in a stream are equally meaningful at any instant in time, as some assets may not be visible to an end user at a certain instant in time. In some examples, prioritizing visible assets over hidden assets in a stream may reduce delays in availability of meaningful assets (e.g., visible assets) to terminal devices (e.g., client devices). In some examples, the priority value of the asset is determined from the visibility of the asset from the camera location (such as a default entry location of the scene, etc.).
In some examples, some assets of a scene are immediately visible when a virtual character (also referred to as a camera in some examples) associated with an end user enters the scene at a portal location (e.g., a default entry location of the scene), but other assets may not be visible when viewed from the portal location or when the camera is placed at the portal location.
Fig. 27 shows a schematic diagram of a scene 2701 in some examples. The camera represents a virtual character having a viewing position corresponding to the end user. The camera enters scene 2701 at default entry position 2702. Scenario 2701 includes asset a, asset B, asset C, and asset D. Asset a is a set of bookcases from floor to ceiling as shown at 2703. Asset B is a computer desk as shown at 2704. Asset C is a conference table as shown at 2705. Asset D is an entry into another scenario as shown in 2706.
In the example of fig. 27, asset B is not visible from default position 2702 because asset B is fully occluded by asset a. In fig. 27, asset B is shaded to indicate that asset B is not visible to the camera at default entry location 2702.
In some examples, the asset prioritizer is to analyze the assets of the scene and assign a priority value to each asset based on the visibility of the asset relative to a default entry location of the scene before the assets of the scene are streamed to the terminal device.
Fig. 28 shows a schematic diagram for reordering assets in a scene (e.g., scene 2701) in some examples. In the example of fig. 28, an asset prioritizer 2802 is used to assign a priority value to each asset in a scene. For example, a scenario (e.g., scenario 2701 in fig. 27) relies on asset a (as shown at 2804), asset B (as shown at 2805), asset C (as shown at 2806), and asset D (as shown at 2807). Initially, the assets of the four assets are ordered as asset a, asset B, asset C, and asset D, as indicated at 2801. Asset prioritizer 2802 analyzes the visibility of assets according to a default entry location for an entry scene. For example, asset a, asset C, and asset D are visible from the default entry location and asset B is not visible from the default entry location. Asset prioritizer 2802 assigns a priority value to the asset based on visibility. For example, asset prioritizer 2802 assigns high priority values to asset a, asset C, and asset D, and low priority values to asset B. The four assets may be reordered according to the priority value, e.g., ordered as asset a, asset C, asset D, and asset B, as indicated at 2803.
In one example, the reordered assets may be streamed to the terminal device by the media server.
Fig. 29 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 29, a scene (e.g., scene 2701) of the scene-based immersive media includes four assets reordered according to priority values of the four assets, e.g., from high to low ordered as asset a, asset C, asset D, and asset B, as shown in 2901. The media server 2902 provides the four assets to the terminal device 2904 in an order of priority values through the cloud 2903. For example, the four assets are retrieved and sent to terminal device 2904 in the order of asset a (shown by 2905), asset C (shown by 2906), asset D (shown by 2907), and asset B (shown by 2908).
According to one aspect of the disclosure, asset prioritizer 2802 may assign a priority value to assets in a scene based on the visibility of the assets. In one example, the highest priority value is assigned to the visible asset such that the visible asset is streamed first and can reach the end user for rendering faster. Comparing the first streaming ordering in fig. 24 with the second streaming ordering in fig. 29, terminal device 2904 may begin displaying the visual assets of the scene earlier than terminal device 2404.
In some examples, asset prioritizer 2802 may assign a priority value to an asset based on whether the asset is blocked from the camera view by another asset. The priority value may be assigned according to a policy of "first visible asset, followed by invisible asset after all visible assets" so that a series of assets may be rendered with a ranking that shows all visible assets to the end user more quickly.
In some examples, asset prioritizer 2802 may assign a priority value to an asset based on whether the asset is obscured by other assets, as the other assets are located between the asset and a light source. The priority value may be assigned according to a policy of "first visible asset, followed by invisible asset after all visible assets" so that a series of assets may be rendered with a ranking that shows all visible assets to the end user more quickly.
Fig. 30 shows a schematic diagram of a scene 3001 in some examples. Scene 3001 is similar to scene 2701 in fig. 27. The camera represents a virtual character having a viewing position corresponding to the end user. The camera enters the scene 3001 at a default entry location 3002. Scene 3001 includes asset a, asset B, asset C, and asset D. Asset a is a set of bookcases from floor to ceiling as shown in 3003. Asset B is a computer desk as shown in 3004. Asset C is a conference table as shown in 3005. Asset D is an entry into another scenario as shown in 3006. Asset B is not visible from the default entry location 3002 because asset B is fully occluded by asset a. In fig. 30, asset B is shaded to indicate that asset B is not visible to the camera in default entry location 3002.
Fig. 30 further illustrates a light source 3007. Because of the location of the light sources 3007, both the conference table 3005 (e.g., between 3008 and 3009) and the entrance 3006 are obscured by shadows of the bookcase 3003. Thus, asset C and asset D are not visible to the camera, and thus in one example, asset prioritizer 2802 may decrease the priority of asset C and asset D.
In some examples, the media server may have a two-part content description: media Presentation Descriptions (MPDs) describing a manifest of available scenes, various alternatives, and other features; and multiple scenarios with different assets. The prioritization assigned by the asset prioritizer may also be included in the manifest of the MPD.
In some examples, the priority value of an asset in a scene of the scene-based immersive media may be defined by a server device or a sender device. The priority value of the asset may be included in the MPD and provided to the client device. In one example, the priority value may be changed by the client device during a session in which the scene-based immersive media is played.
Some aspects of the present disclosure provide a method for dynamic adaptive streaming of light field based displays according to assigned asset priority values. Assets may be transmitted (e.g., streamed) based on the assigned asset priority value. In some examples, the asset priority value assigned to the asset is reduced in response to the asset being blocked from the camera view by one or more other assets and thus not being visible. In some examples, the asset priority value assigned to an asset is reduced when the asset is obscured from one or more light sources and not visible to the camera by one or more other assets. In some examples, the priority value (asset priority value) is adjusted based on feedback (e.g., updated camera position) from the client device.
According to a fifth aspect of the disclosure, assets in a scene-based immersive media, such as scene-based immersive media for a light field-based display (e.g., a light field display, a holographic display, etc.), may be prioritized based on a plurality of policies.
According to one aspect of the disclosure, not all assets in a stream arrive at a terminal device at the same time for various reasons, including but not limited to differences in asset size. Moreover, not all assets have equal value at any time for a variety of reasons, including but not limited to, some assets being obscured by other assets. Clients may wish to prioritize assets for various reasons.
In one example, given a sufficiently detailed inventory of assets used in a descriptive scenario, a client's terminal device itself may have sufficient local resources to perform all asset prioritization. In many foreseeable cases, for various reasons (e.g., available processor power, or the need to offload asset prioritization and preserve available battery life). Some aspects of the present disclosure provide a method of prioritizing assets for streaming on behalf of a terminal device of a client. For example, the priority value of an asset is assigned outside the terminal device of the client and is determined by one or more prioritization schemes.
In some examples, an asset prioritizer is used to analyze assets required by a scene and assign priorities to each asset before the assets of the scene are streamed to a terminal device.
FIG. 31 shows a schematic diagram for assigning priority values to assets in a scene in some examples. In the example of fig. 31, an asset prioritizer 3102 is used to assign one or more priority values to each asset in a scene. For example, a scenario (e.g., scenario 2701 in fig. 27) relies on asset a (as shown at 3105), asset B (as shown at 3106), asset C (as shown at 3107), and asset D (as shown at 3108). Initially, the assets of the four assets are ordered as asset a, asset B, asset C, and asset D, as shown at 3101. Asset prioritizer 3102 assigns priority values to assets according to a plurality of policies. The assets are reordered according to the assigned priority value. In FIG. 31, assets A-D are reordered into asset A, asset C, asset D, and asset B according to the assigned priority value, as shown at 3103. It should also be noted that each asset in 3103 now carries one or more priority values assigned by asset prioritizer 3102, e.g., two priority values shown by +p1+p2 for each asset. In one example, asset prioritizer 3102 may assign a first set of priority values (e.g., P1) to each asset in the scene according to a first priority policy and a second set of priority values (e.g., P2) to each asset in the scene according to a second priority policy. The assets may be reordered according to a first set of priority values or the assets may be reordered according to a second set of priority values.
In some examples, after prioritizing the assets, the assets are requested and streamed by the prioritization.
Fig. 32 shows a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 32, the scene of the scene-based immersive media includes four assets that are reordered according to priority values of the four assets, e.g., from high to low as asset a, asset C, asset D, and asset B, as shown at 3201. The media server 3202 provides the four assets to the terminal device 3204 through the cloud 3203 in order of priority values. For example, the four assets are retrieved and sent to terminal device 3204 in the order of asset a (shown by 3205), asset C (shown by 3206), asset D (shown by 3207), and asset B (shown by 3208).
According to one aspect of the disclosure, the terminal device can begin rendering the scene faster when the terminal device prioritizes assets requesting the scene.
It should be noted that asset prioritizer 3102 may use any ranking criteria that provides an improved use experience as a basis for assigning priority values to assets.
In some examples, asset prioritizer 3102 may use more than one prioritization scheme and assign more than one prioritization value to each asset so that media server 3202 may then use the most appropriate prioritization scheme for each terminal device 3204.
In one example, terminal device 3204 is at the end of a relatively long and/or low bit rate network path from media server 3202, terminal device 3204 may benefit from a prioritization scheme that prioritizes assets from a minimum file size to a maximum file size so that terminal device may quickly begin rendering more assets without waiting for larger assets to arrive. In another example, where the terminal device 3204 is at the end of a shorter and faster network path from the media server 3202, the asset prioritizer 3102 may assign individual priorities to each asset based on whether the asset is blocked from the camera view by another asset. The priority policy may be as simple as "first visible asset followed by invisible asset" so that the media server 3202 may provide a series of assets in a ranking that allows the end device to display all visible assets to the end user more quickly.
In some examples, media server 3202 may include an allocation priority value for an asset in a description (e.g., MPD) transmitted to terminal device 3204. When two or more assets have the same client assigned priority value (e.g., a prioritization scheme with a small number of different possible values, such as "visible"/"invisible"), the client at terminal device 3204 may use an additional prioritization scheme not supported by asset prioritizer 3102 while relying on the assigned priority value in the description as a tie breaker (tiebreaker).
Some aspects of the present disclosure provide a method for dynamic adaptive streaming based on assigned asset priority values for light field based displays. In some examples, asset ordering in asset streaming may be based on one of a plurality of assigned asset priority values assigned using different asset prioritization schemes. In some examples, the same set of assets may be transmitted to different clients in different ranks based on priority values assigned to each asset in the set using different asset prioritization schemes. In some examples, asset prioritizer assigned priority values are transmitted to clients along with asset descriptions, the asset prioritizer assigned priority values allowing clients to use asset prioritization schemes that are not supported by the asset prioritizers while relying on the asset prioritizer assigned priority values to prioritize assets having the same client assigned asset priority values. In some examples, the priority value may be adjusted by the asset prioritizer based on feedback from the customer.
According to a sixth aspect of the present disclosure, assets in a scene-based immersive media, such as scene-based immersive media for a light field-based display (e.g., a light field display, a holographic display, etc.), may be prioritized based on distances within a field of view.
According to one aspect of the disclosure, not all assets in the stream are equally significant at any time, as some assets may go beyond the user's field of view (e.g., represented by a avatar or represented by a virtual camera) when the user enters the scene. To minimize the delay incurred until the most visible and most obvious assets to the end user become available to the end device, assets visible within the field of view at the scene default entry location of the user entry scene may be prioritized. In some examples, assets in a scene may be prioritized according to whether the assets are within a field of view of a virtual camera of a renderer, and a distance of the assets from a virtual camera location for rendering a view presented to a user.
Fig. 33 shows a schematic diagram of a scenario 3301 in some examples. Scene 3301 is similar to scene 2701 in fig. 27. The camera represents a virtual character having a viewing position corresponding to the end user. The camera enters the scene 3301 at a default entry position 3302 (also referred to as an initial position). Scene 3301 includes asset a, asset B, asset C, and asset D. Asset a is a set of bookcases from floor to ceiling as shown at 3303. Asset B is a computer desk as shown in 3304. Asset C is a conference table as shown in 3305. Asset D is an entry into another scenario as shown in 3306.
When the virtual camera enters scene 3301 at default entry location 3302, some assets are immediately visible, but other assets may not. The field of view (also referred to as field of view) 3307 of the virtual camera is limited and does not include the entire scene. In the example of fig. 33, asset a and asset B are visible from the initial position 3302 of the virtual camera, asset a and asset B being within the field of view 3307 of the virtual camera. Note that asset a is farther from the initial position 3302 of the virtual camera than asset B.
In fig. 33, asset C and asset D are not visible from the initial position 3302 of the virtual camera because asset C and asset D are outside the field of view 3307 of the virtual camera.
In some examples, the asset prioritizer is to analyze assets required for the scene before the assets for the scene are streamed to the terminal device and assign a priority value to each asset according to a field of view associated with a default entry location of the scene and a distance of the asset to the default entry location of the scene.
FIG. 34 shows a schematic diagram for reordering assets in a scene in some examples. In the example of fig. 34, an asset prioritizer 3402 is used to assign a priority value to each asset in a scene. For example, a scenario (e.g., scenario 3301 in fig. 33) relies on asset a (as shown at 3404), asset B (as shown at 3405), asset C (as shown at 3406), and asset D (as shown at 3407). Initially, the assets of the four assets are ordered as asset a, asset B, asset C, and asset D, as shown at 3401. Asset prioritizer 3402 analyzes assets according to a field of view at a default entry location for an entry scene and a distance in the field of view from the default entry location. In some examples, the priority value of the asset may be determined based on whether the asset is visible within the camera field of view. In some examples, the priority value of the asset may be based on the relative distance of the asset from the camera. For example, asset a and asset B are within the field of view and visible from the default entry location, asset C and asset D are outside the field of view and not visible from the default entry location. Further, asset B is a shorter distance to the default entry location than asset a. The four assets may be reordered according to priority values, e.g., ordered as asset B, asset a, asset C, and asset D, as shown at 3403.
In one example, the reordered assets may be streamed to the terminal device by the media server.
Fig. 35 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 35, the scene of the scene-based immersive media includes four assets that are reordered according to priority values of the four assets, e.g., from high to low as asset B, asset a, asset C, and asset D, as shown at 3501. The media server 3502 provides these four assets to the terminal device 3504 via the cloud 3503 in order of priority values. For example, the four assets are retrieved and sent to terminal device 3504 in the order of asset B (shown by 3505), asset a (shown by 3506), asset C (shown by 3507), and asset D (shown by 3508).
Comparing the first streaming ordering in fig. 24 with the second streaming ordering in fig. 35 may stream the visible asset first, so terminal device 3504 may render the visible asset of the scene faster than terminal device 2404.
In some examples, the asset prioritizer 3402 may assign a priority value of an asset based on a field of view of the virtual camera and a distance from the virtual camera such that the asset closest to the user within the field of view is rendered first.
In some examples, assets outside of the field of view (also referred to as the field of view) may not be scheduled for transmission to the terminal device 3504 until all assets within the field of view have been scheduled for transmission. There may be a delay before assets outside the field of view are scheduled for transmission. In one extreme case, if a user quickly views a scene and exits the scene to enter another scene, then assets that have not been scheduled for transmission may not be scheduled until the exiting user reenters the scene.
Fig. 36 shows a schematic diagram of a display scene 3601 in some examples. Scene 3601 is displayed corresponding to scene 3301. In displaying scene 3601, the book case 3603 and workstation 3604 have been retrieved and rendered because the book case 3603 and workstation 3604 are assets in scene 3601 that are visible to the camera from a default entry location.
Some aspects of the present disclosure provide a method for dynamic adaptive streaming of light field based displays based on an assigned asset priority value determined by whether an asset is within a field of view (also referred to as field of view) of a camera (e.g., a virtual camera corresponding to a user or player for an immersive media experience). In some examples, the assigned asset priority value of the asset is determined by the distance of the asset from the camera location. The assets may be transmitted in an order determined based on the asset priority value. In some examples, the asset priority value of the asset is lowered when the asset is outside of the field of view of the camera. In some examples, the asset priority value of an asset is lowered when the asset is farther from the camera than another asset within the field of view. In some examples, the priority value is adjusted based on feedback from one or more client devices.
According to a seventh aspect of the present disclosure, assets in a scene-based immersive media, such as scene-based immersive media for a light field-based display (e.g., a light field display, a holographic display, etc.), may be prioritized based on asset complexity.
According to one aspect of the disclosure, not all assets in a scene are equally complex and do not require the same amount of computation from computing resources (e.g., graphics Processing Units (GPUs)) to render the assets for display on an end user's terminal device. To minimize the delay incurred until the scene can be presented to the end user, in some examples, the assets that make up the scene may be prioritized for streaming based on the relative complexity of each asset presented in the scene description.
Fig. 37 shows a schematic diagram of a scene 3701 in a scene-based immersive media in one example. Scenario 3701 includes four assets: asset a, asset B, asset C, and asset D. In fig. 37, asset a is a modern house with a smooth appearance, as shown by 3705; asset B is a real person, as shown by 3706; asset C is a non-realistic humanoid statue, as shown in 3707; asset D is a tree line (tree line) of high detail, as shown in 3708.
In one example, when a terminal device requests an asset of scene 3701, for example, media server 2402 may provide the asset to the terminal device in any ordering, such as in the order of asset a, asset B, asset C, and asset D. The ordering of assets that become available to the terminal device may not provide an acceptable user experience.
It should be noted that the assets in scenario 3701 are not of equal complexity, and the assets do not all impose the same computational load on the GPUs in the terminal device. In this disclosure, the number of polygons that make up an asset is used as an indicator (proxy) that represents the computational load of the asset on the GPU. It should be noted that other features of the asset, such as surface properties, etc., may place computational load on the GPU.
In some examples, the number of polygons that make up the asset is used to measure the complexity of the asset. In the assets of scene 3701, asset a is a modern house with a smooth appearance and has the lowest number of polygons, asset C is a non-realistic human statue and has the second lowest number of polygons, asset B is a realistic person and has the second highest number of polygons, and asset D is a highly detailed tree line and has the highest number of polygons. Based on the number of polygons, the assets in scene 3701 can be ordered from lowest complexity to highest complexity in the order of asset a, asset C, asset B, and asset D.
In some examples, an asset prioritizer is used to analyze assets required for a scene and assign a priority value to each asset based on the relative complexity of the assets before the assets are streamed to the terminal device.
FIG. 38 shows a schematic diagram for prioritizing assets in a scene in some examples. In the example of fig. 38, an asset prioritizer 3802 is used to assign a priority value to each asset in a scene based on the relative complexity of the asset. For example, a scenario (e.g., scenario 3701 in fig. 37) relies on asset a (as shown at 3804), asset B (as shown at 3805), asset C (as shown at 3806), and asset D (as shown at 3807). Initially, the assets of the four assets are ordered as asset a, asset B, asset C, and asset D, as shown at 3801. Asset prioritizer 3802 prioritizes (e.g., reorders) the assets according to their relative complexity, e.g., based on the number of polygons in each asset. For example, asset D has the highest number of polygons and is assigned the highest priority value (e.g., 4); asset B has the second highest number of polygons and is assigned the second highest priority value (e.g., 3); asset C has the second lowest number of polygons and is assigned the second lowest priority value (e.g., 2); and asset a has the lowest number of polygons and is assigned the lowest priority value (e.g., 1). The four assets may be reordered according to priority values, for example, as shown at 3803, to asset D, asset B, asset C, and asset a.
It should be noted that while in the above example the number of polygons in an asset is used to evaluate the complexity of the asset and determine the priority value of the asset, other attributes of the asset that may affect the computational load on the GPU within the terminal device may be used to evaluate the complexity and determine the priority value.
In one example, the reordered assets may be streamed to the terminal device by the media server.
Fig. 39 illustrates a schematic diagram of an example for streaming scene-based immersive media to a terminal device in some examples. In the example of fig. 39, the scene of the scene-based immersive media includes four assets that are reordered according to priority values of the four assets, e.g., from high to low as asset D, asset B, asset C, and asset a, as shown in 3901. The media server 3902 provides the four assets to the terminal device 3904 through the cloud 3903 in an order of priority values. For example, the four assets are retrieved and sent to terminal device 3904 in the order of asset D (shown by 3905), asset B (shown by 3906), asset C (shown by 3907), and asset a (shown by 3908).
Comparing the first streaming ordering in fig. 24 with the second streaming ordering in fig. 39, terminal device 3904 may render a scene faster than terminal device 2404 because the time required to render the most complex asset acts as the minimum elapsed time for the terminal device to render the entire scene (minimum elapsed time). In some examples, GPUs in the terminal device may be configured to be multi-threaded such that GPUs may render less complex assets in parallel while rendering the most complex assets, and less complex assets will be available when the most complex assets have been rendered.
Some aspects of the present disclosure provide a method for dynamic adaptive streaming of light field based displays based on assigned asset priority values determined by an expected computational load on a GPU. In some examples, the ordering of assets for streaming may be based on asset priority values. In some examples, the asset priority value is increased when the computational complexity of the asset is higher than other assets. In some examples, the priority value is adjusted based on feedback from the client device. For example, the feedback may indicate GPU configuration at the client device, or different attributes for complexity assessment.
According to an eighth aspect of the present disclosure, assets in a scene-based immersive media, such as scene-based immersive media for a light field-based display (e.g., a light field display, a holographic display, etc.), may be prioritized based on a plurality of metadata attributes.
According to one aspect of the disclosure, each asset in a scene may have various metadata attributes that each will allow a renderer to prioritize the ordering of the assets that are retrieved in order to render the scene, thereby providing a better experience to the user than simply retrieving the assets in the order they appear in the scene manifest.
According to another aspect of the present disclosure, a priority value based on a single metadata attribute may not reflect the best policy for prioritization. In some examples, a set of priority values reflecting multiple metadata attributes may provide a better user experience than another set of priority values based on a single metadata attribute.
According to another aspect of the present disclosure, a priority value based on a single priority preference set may not reflect the best strategy for prioritization, as the terminal devices may differ significantly in many respects—path characteristics between the terminal device and the cloud, and GPU computing capabilities of the terminal device, to name just two of these. To provide a better user experience, techniques may be used to prioritize assets for streaming based on two or more asset metadata attributes.
Using scenario 3701 in fig. 37 as an example, each of the four assets may have multiple metadata attributes.
Fig. 40 shows a schematic diagram illustrating metadata attributes of an asset 4001 in one example. Asset 4001 may correspond to asset B3706 in scene 3701, asset B3706 being a real person. Asset 4001 includes various metadata attributes 4002, such as size in bytes as shown in 4003, number of polygons as shown in 4004, distance from default camera as shown in 4005, visibility from default camera (also referred to as virtual camera at default entry location) as shown in 4006, and other metadata attributes as shown in 4007 and 4008.
It should be noted that metadata attribute 4002 is for illustration, in some examples, other suitable metadata attributes may be present, and in some examples, some of the metadata attributes listed in fig. 40 may not be present.
In one example, when a terminal device requests an asset of scene 3701, for example, media server 2402 may provide the asset to the terminal device in any ordering, such as in the order of asset a, asset B, asset C, and asset D. The ordering of assets that become available to the terminal device may not provide an acceptable user experience.
Assets in scenario 3701 are not of equal size, complexity, and immediate value to the end user. Each of these metadata attributes may be used alone as a basis for prioritization, but each of these metadata attributes may also be used in combination with other metadata attributes to determine relative asset priorities. For example, assets may be prioritized according to a policy that includes a plurality of metadata attributes, such as a first metadata attribute shown by 4006 that is visible from a default camera, a second metadata attribute shown by 4004 that is expressed as a number of polygons for a digital range, and a third metadata attribute shown by 4005 that is expressed as a distance from the default camera. In one example, among the assets of the scene, the first asset visible from the default camera is assigned the highest priority (e.g., 4 in one example), and the remaining assets are further processed according to the second metadata attribute and the third metadata attribute. For example, of the remaining assets, a second asset having a number of polygons greater than the polygon threshold is assigned a second highest priority (e.g., 3 in one example), and the remaining assets are further processed according to the third metadata attribute. For example, of the remaining assets, a third asset whose distance to the default camera is shorter than the distance threshold is assigned a third highest priority (e.g., 2 in one example).
In some examples, the asset prioritizer is used to analyze assets required by the scene before the assets of the scene are streamed to the terminal device and to assign a priority to each asset according to a prioritization scheme being used, such as a policy that includes a plurality of metadata attributes. For example, asset prioritizer 3802 may be used to apply prioritization using policies with multiple metadata attribute combinations. After prioritizing the assets, the assets are streamed in the prioritization, as shown in FIG. 39.
The use of prioritization based on a plurality of metadata attributes allows the asset to be transmitted to the terminal device in a manner that is appropriate for the terminal device to improve the user experience. For example, the terminal device may request the asset that is immediately visible to the user (e.g., according to a first metadata attribute), then request the asset that is calculated to be most demanding in rendering (e.g., according to a second metadata attribute), and then request the asset in the rendering scene that is closest to the user (e.g., according to a third metadata attribute).
According to one aspect of the disclosure, not all terminal devices have the same capabilities and not all terminal devices have the same path characteristics, e.g., available bandwidth between the terminal device and the cloud, the best user experience for different users may come from the use of different prioritization schemes. In some examples, the terminal device may provide the asset prioritizer with priority preference information to assign priority values accordingly.
FIG. 41 shows a schematic diagram for prioritizing assets in a scene in some examples. In the example of fig. 41, the terminal device 4108 can provide a set of priority preferences 4109 to the asset prioritizer 4102. For example, the priority preference set 4109 includes a plurality of metadata attributes. The asset prioritizer 4102 may then assign a priority value to each asset in the scene based on the priority preference set. Thus, the assigned priority value of the assets in the scene is optimized for the terminal device 4108.
In some examples, the terminal device 4108 can dynamically adjust the set of priority preferences based on measurements made by the terminal device. For example, when the terminal device 4018 has a relatively low available bandwidth to the cloud, the terminal device can prioritize the assets based on asset size measured in bytes (e.g., 4003 in fig. 40) because the transfer time of the largest asset will serve as a lower limit of how fast the scene can be rendered. When the same terminal device 4108 detects that the path from the cloud to the terminal device 4108 has sufficient bandwidth, the terminal device 4108 may provide a different set of priority preferences to the asset prioritizer 4102.
In some examples, the related metadata attribute (related metadata attributes) may be considered a composite attribute, and the terminal device may include the composite attribute in its priority preference set. For example, if the available metadata attributes for each asset include some measure of the number of polygons and the level of surface detail, these attributes may together be considered a measure of computational complexity, and the terminal device may include this measure in its priority preference set.
Some aspects of the present disclosure provide a method for dynamic adaptive streaming of light field based displays according to assigned asset priority values. Asset priority values may be assigned based on two or more metadata attributes for each asset to be streamed. In some examples, asset priority values for assets in a scene are determined based on a priority preference set. In some examples, the ordering of assets for streaming is based on asset priority values. In some examples, the priority preference set differs from terminal device to terminal device. In some examples, the priority preference set is provided by the terminal device. In some examples, the priority preference set is updated by the terminal device. In some examples, the set of priority preferences includes one or more composite attributes. In some examples, the set of priority preferences provided by the terminal device includes one or more composite attributes.
Fig. 42 shows a flowchart outlining a process 4200 according to an embodiment of the present disclosure. Process 4200 may be performed in a network, for example by a server device in the network. In some embodiments, process 4200 is implemented in software instructions, so when the processing circuitry executes the software instructions, the processing circuitry performs process 4200. For example, the smart client is implemented in software instructions, and the software instructions may be executed to perform a smart client process including process 4200. The process starts from S4201 and proceeds to S4210.
At S4210, the server device receives scene-based immersive media for play on the light field-based display. A scene in the scene-based immersive media includes a first ordered plurality of assets.
At S4220, the server device determines a second ranking for streaming the plurality of assets to the terminal device. The second ordering is different from the first ordering.
In some examples, the server device assigns priority values to the plurality of assets in the scene according to one or more attributes of the plurality of assets, respectively, and determines a second ranking for streaming the plurality of assets according to the priority values.
In some examples, the server device includes priority values for the plurality of assets in the description for the plurality of assets and provides the description to the terminal device. The terminal device requests a plurality of assets according to the priority values in the description.
According to some aspects of the present disclosure, a server device assigns a priority value to an asset in a scene according to the size of the asset. In one example, in response to the first asset having a byte size that is smaller than the second asset, the server device assigns a first priority value to the first asset and assigns a second priority value to the second asset, the first priority value being higher than the second priority value.
According to some aspects of the present disclosure, a server device assigns a priority value to an asset in a scene in accordance with the visibility of the asset in the scene associated with a default entry location of the scene. In some examples, the server device assigns a first priority value to a first asset visible from a default entry location of the scene and assigns a second priority value to a second asset not visible from the default entry location, the first priority value being higher than the second priority value. In one example, the server device assigns a second priority value to the asset in response to the asset being blocked by another asset in the scene. In another example, the server device assigns a second priority value to the asset in response to the asset being outside of a field of view associated with a default entry location of the scene. In another example, the server device assigns a second priority value to the asset in response to the asset being obscured by a shadow of another asset in the scene due to a light source in the scene.
According to some aspects of the disclosure, the server device assigns a first priority value to the first asset and a second priority value to the second asset in response to the first asset and the second asset being within a field of view associated with a default entry location of the scene, the first asset being closer to the default entry location of the scene than the second asset, and the first priority value being greater than the second priority value.
In some examples, the server device assigns a first set of priority values to the plurality of assets based on a first attribute of the plurality of assets, the first set of priority values being used to rank the plurality of assets in a first prioritization scheme. Further, the server device assigns a second set of priority values to the plurality of assets based on a second attribute of the plurality of assets, the second set of priority values being used to rank the plurality of assets in a second prioritization scheme. The server device selects a prioritization scheme for the terminal device from the first and second prioritization schemes according to the information of the terminal device, and ranks the plurality of assets for streaming according to the selected prioritization scheme.
In some examples, the server device includes priority values assigned to the plurality of assets in the description for the plurality of assets. The priority value is used to rank the plurality of assets in a first prioritization scheme. The server device provides the terminal device with a description for the plurality of assets. In one example, the terminal device applies a second prioritization scheme based on the request ordering of the plurality of assets and uses the first prioritization scheme as a tie breaker in response to a tie of the second prioritization scheme.
In some examples, in response to the first asset having a higher computational complexity than the second asset, the server device assigns a first priority value to the first asset and assigns a second priority value to the second asset. The first priority value is higher than the second priority value. In one example, the higher computational complexity is determined based on at least one of: the number of polygons in the first asset, and the surface properties of the first asset.
In some examples, the one or more attributes are metadata attributes associated with each asset and may include at least one of: the size in bytes, the number of polygons, the distance from the default camera, and the visibility from the default camera.
In some examples, the server device receives a priority preference set for the terminal device, the priority preference set including a first set of attributes, and determines the second ranking from the first set of attributes. In one example, a server device receives an update of a priority preference of a terminal device, the update of the priority preference indicating a second set of attributes different from the first set of attributes. The server device may determine a second ordering of updates based on the second set of attributes.
In some examples, the server device receives a feedback signal from the terminal device and adjusts the priority value and the second ranking according to the feedback signal.
Then, the process 4200 proceeds to S4299 and ends.
The process 4200 may adapt the various scenarios appropriately and the steps in the process 4200 may be adjusted accordingly. One or more steps of process 4200 may be modified, omitted, repeated, and/or combined. Process 4200 may be implemented using any suitable order. Additional step(s) may be added.
Fig. 43 shows a flowchart outlining a process 4300 in accordance with an embodiment of the present disclosure. The process 4300 may be performed by an electronic device, such as a terminal device (also referred to as a client device). In some embodiments, the process 4300 is implemented in software instructions, so that when the processing circuitry executes the software instructions, the processing circuitry performs the process 4300. For example, process 4300 is implemented as software instructions that the processing circuitry may execute to perform the intelligent controller process. The process starts at S4301 and proceeds to S4310.
At S4310, a request for scene-based immersive media for play on the light field-based display is sent by the electronic device to the network. A scene in the scene-based immersive media includes a first ordered plurality of assets.
At S4320, information of the electronic device is provided for adjusting a ranking of streaming the plurality of assets to the electronic device. The ranking is then determined from the information of the electronic device.
In some examples, an electronic device receives a description of a plurality of assets in a scene of a scene-based immersive media for play on a light field-based display. The description includes priority values for a plurality of assets. The electronic device requests one or more of the plurality of assets according to the priority value.
In some examples, the priority value is associated with a first prioritization scheme. The electronic device determines that the first asset and the second asset have the same priority according to a second prioritization scheme different from the first prioritization scheme. The electronic device then prioritizes one of the first asset and the second asset according to a first prioritization scheme.
In some examples, the electronic device provides a set of priority preferences to the network. Further, in one example, the electronic device detects an operating environment change, such as a network state change, and provides an update to the priority preference set in response to the operating environment change.
In some examples, the electronic device detects an operating environment change, such as a network state change. The electronic device then provides a feedback signal to the network indicating the change in the operating environment.
Then, the process 4300 proceeds to S4399 and ends.
The process 4300 may be adapted to various scenarios as appropriate, and the steps in the process 4300 may be adjusted accordingly. One or more steps of the process 4300 may be modified, omitted, repeated, and/or combined. Process 4300 may be implemented using any suitable order. Additional step(s) may be added.
According to a ninth aspect of the disclosure, a scene analyzer may be used to determine a prioritization of asset rendering based on asset visibility in a scene default viewport.
According to one aspect of the disclosure, a client device supporting scene-based media may be equipped with a renderer and/or game engine that supports the concept of default cameras and viewports in the scene. The default camera provides camera properties that the renderer can use to create an image of the scene interior. Such internal images are therefore defined by camera properties (e.g., horizontal and vertical resolution, x and y dimensions, depth of field, and many other possible properties). The viewport is defined by the internal image itself and the virtual position of the camera in the scene. The virtual position of the camera in the scene provides the content creator or user with the ability to accurately specify which portion of the scene is rendered into the internal image. Thus, objects that are not directly visible in the viewport of the default camera (e.g., objects behind the default camera or outside the viewport of the scene), i.e., if the user is not in any way visible, may not be preferentially rendered by the renderer. Nonetheless, some objects may be located close enough to the viewport of the default camera that the presence of these objects and/or shadows cast near the viewport may actually be present in the viewport. For example, if a point source of light that is not visible in the viewport is located behind other objects that are also not visible in the scene, light reflected into the viewport of the scene may be an affected shadow created by the invisible objects because the presence of the invisible objects blocks the ability of light to enter into the viewport of the scene. Therefore, it is important that the renderer render such objects that have an impact on the ability of light to enter the portion of the scene captured by the viewport.
Some aspects of the present disclosure provide techniques for a scene analyzer to facilitate an individual decision process that orders multiple individual assets within a scene to be converted from format a to format B by a client or on behalf of a client. The scene analyzer may calculate and store priority values in metadata in the scene, which may be determined based on a number of different factors including whether the asset is in a scene area captured by a default camera viewport. The calculated priority value may then be stored directly into metadata describing the scene such that subsequent processes in the network or in the client device are signaled of the potential importance of converting one asset before converting another, i.e., as indicated by the priority value stored into the scene metadata by the scene analyzer. In some examples, such a scene analyzer may utilize metadata describing camera characteristics that is intended to create a viewport into the scene, i.e., calculate whether a particular asset is inside or outside of the viewport of the camera into the scene. The scene analyzer may calculate the area of the viewport based on attributes of the camera (e.g., width and length of the viewport, depth of field of the focused object in the camera, etc.). In some examples, a first asset that is not in a camera viewport of a scene may be prioritized by a scene analyzer to transition after a second asset in a default camera viewport is transitioned. Such prioritization may be signaled by the scene analyzer and stored into the scene metadata. Storing such metadata into the scene may facilitate subsequent processes within the distribution network, i.e., if the scene analyzer has calculated and stored the prioritization into the metadata of the scene, then the same calculations to determine which assets to switch before other assets need not be performed. The process of the scene analyzer may be performed prior to streaming the media by the client, or in some examples as specific steps performed by the client device itself.
FIG. 44 shows a schematic diagram of a timed media representation 4400 that signals assets in a default camera view in some examples. Timing media representation 4400 includes timing scene manifest 4400A, timing scene manifest 4400A including information of a list of scenes 4401. The information of the scene 4401 refers to a list of components 4402 that describe the processing information and media asset types in the scene 4401, respectively. Component 4402 refers to asset 4403, asset 4403 further referring to base layer 4404 and property enhancement layer 4405. For each scene 4401, a list of assets 4407 is provided that is located in the default camera viewport of the scene.
Fig. 45 shows a schematic diagram of a non-timed media representation 4500 that signals assets in a default viewport in some examples. An untimed scene list (not depicted) references scene 1.0, and no other scenes can branch into scene 1.0. The information of scene 4501 is not associated with a start and end duration according to a clock. The information of scene 4501 refers to a list of components 4502 that describe the processing information and media asset types in the scene, respectively. Component 4502 refers to asset 4503, asset 4503 further referring to base layer 4504 and attribute enhancement layer 4505 and attribute enhancement layer 4506. In addition, the information of the scene 4501 may refer to other scene information 4501 for non-timed media. The information of the scene 4501 may also refer to scene information 4507 for timing a media scene. For a scene, list 4508 identifies assets whose geometry is in the viewport of the default camera.
Referring back to fig. 9, in some examples, the media analyzer 911 in fig. 9 includes a scene analyzer that can perform analysis on assets of a scene. In one example, the scene analyzer may determine which of the assets of the scene are in the default viewport. For example, ingest media may be updated to store information into the scene regarding which assets are located in the default viewport of the camera.
Referring back to fig. 14, in some examples, the media analyzer 1410 in fig. 14 includes a scene analyzer that can perform analysis on assets of a scene. In one example, the scene analyzer may examine the client-adapted media 1409 to determine which assets are located in the viewport of the default camera of the scene (also referred to as the default viewport for entering the scene) for potential prioritization for rendering by the game engine 1405 and/or reconstruction process via the MPEG smart client 1401. In such an embodiment, the media analyzer may store a list of assets for each scene in the viewport of the default camera into each respective scene in the client adapted media 1409.
FIG. 46 shows a process flow 4600 for a scene analyzer to analyze assets in a scene according to a default viewport. The process begins at step 4601, where the scene analyzer obtains attributes (e.g., from the scene or from an external source not shown) describing default camera locations and features of a default viewport (into an initial view of the scene when the field Jing Di is rendered once) based on the attributes of the camera used to define the viewport. Such camera attributes include the type of camera used to create the viewport, as well as the length, width, and depth dimensions of the region in the scene acquired by the viewport. In step 4602, the scene analyzer may calculate an actual default viewport based on the attributes acquired in step 4601. Next, in step 4603, the scene analyzer determines if there are more assets in the asset list for the particular scene to be inspected. When there are no more assets to be inspected, then the process proceeds to step 4606. In step 4606, the scene analyzer writes (writes metadata for the scene) a list of assets whose geometry (in whole or in part) is in the default viewport of the scene. The analyzed media is depicted as 4607. When there are more assets to process at step 4603, then the process continues to step 4604. In step 4604, the scene analyzer analyzes the geometry of the next asset in the scene and the next asset becomes the current asset. The geometry and location of the current asset within the scene is compared to the area of the default viewport calculated at step 4602. In step 4604, the scene analyzer studies the geometry of the asset and its location in the scene to determine if any portion of the asset geometry falls within the region of the default viewport. If any portion of the asset geometry falls within the default viewport, the process continues to step 4605. In step 4605, the scene analyzer stores the asset identifier in a list of assets located in the default viewport. After step 4605, the process returns to step 4603 where in step 4603 the list of assets of the scene is checked to determine if there are more assets to process.
Some aspects of the present disclosure provide methods of asset prioritization. The method may calculate a position and size of a viewport created by a camera used to view the scene and compare the geometry of each asset of the scene to the viewport to determine whether a portion of the geometry of the asset intersects the viewport (e.g., the position and size of the viewport created by the camera). The method identifies an asset whose geometry intersects, in part or in whole, a viewport created by the camera and stores an identification of the asset into a list of similar assets whose geometry intersects an area of the viewport created by the camera. The method then includes ordering and prioritizing rendering of assets for a scene in the scene-based media presentation based on a list of similar assets whose geometry intersects a viewport region created by the camera.
Fig. 47 shows a flowchart outlining a process 4700 in accordance with an embodiment of the present disclosure. Process 4700 may be performed in a network, for example by a server device in the network. In some embodiments, process 4700 is implemented in software instructions, so when the processing circuitry executes the software instructions, the processing circuitry executes process 4700. The process starts at S4701 and proceeds to S4710.
At S4710, attributes of a viewport are determined, such as a position and a size of the viewport associated with a camera used to view a scene in the scene-based immersive media. The scene includes a plurality of assets.
At S4720, for each asset, it is determined whether the asset intersects the viewport at least in part based on the position and size of the viewport and the geometry of the asset.
At S4730, in response to the asset intersecting at least in part with the viewport, an identifier of the asset is stored in the list of visible assets.
At S4740, a list of visible assets is included in metadata associated with the scene.
In some examples, the position and size of the viewport is determined based on camera positions and features of a camera used to view the scene.
In some examples, an ordering for converting at least the first asset and the second asset from the first format to the second format is determined from a list of visible assets. In one example, the list of visible assets is converted from a first format to a second format before converting another asset of the scene that is not in the list of visible assets.
In some examples, the ordering of streaming at least the first asset and the second asset to the terminal device is determined from a list of visible assets. In one example, the list of visible assets is streamed before another asset of the scene that is not in the list of visible assets is streamed.
Then, the process 4700 proceeds to S4799 and ends.
The process 4700 may be adapted to various scenarios as appropriate, and the steps in the process 4700 may be adjusted accordingly. One or more steps of process 4700 may be modified, omitted, repeated, and/or combined. Process 4700 may be implemented using any suitable order. Additional step(s) may be added.
According to a tenth aspect of the present disclosure, techniques for prioritizing media adaptations by the type of additional clients may be used.
According to one aspect of the present disclosure, one of the problems affecting the efficiency of a media distribution network for heterogeneous clients is determining which media to prioritize for conversion in cases where there are a large number of clients (e.g., client devices, terminal devices) attached to the network where the media needs to be converted.
It is assumed that without loss of generality, the conversion process may also be referred to as an "adaptation" process, as both names may be used in the industry. A network performing media adaptation on behalf of its attached clients would then benefit from information that directs the prioritization of such adaptations by information about the current client type attached to the network.
Some aspects of the present disclosure provide techniques for a media adaptation or media conversion process that may be performed to maximize the number of clients that may benefit from the adaptation (or conversion) process performed by the network. The disclosed subject matter addresses the need for a network that performs some or all of the media adaptation on behalf of one or more clients and client types to prioritize the adaptation process based on the number and type of clients currently attached to the network.
Fig. 48 illustrates a schematic diagram of an immersive media distribution module 4800 in some examples. The immersive media distribution module 4800 is similar to the immersive media distribution module 900 in fig. 9, but with the addition of a client tracking process shown by 4812 to illustrate immersive media network distribution with a client tracking process. The immersive distribution module 4800 can service a legacy display as previously depicted in fig. 8 and a display with heterogeneous immersive media capabilities. As shown in 4801, content is created or obtained, which is further embodied in fig. 5 and 6 for natural content and CGI content, respectively. The content of 4801 is then converted to an ingest format using the create network ingest format process shown as 4802. The process shown at 4802 is also further embodied in fig. 5 and 6 for natural content and CGI content, respectively. The ingest media is optionally annotated using IMS metadata or analyzed by a scene analyzer in media analyzer 4811 to determine complexity attributes. The ingest media format is transmitted to the network and stored on the storage device 4803. In some examples, the storage device may reside in the network of the immersive media content producer and be accessed remotely by an immersive media network distribution process (not numbered), as depicted by the dashed line bisecting 4803. In some examples, the client and application specific information is available on remote storage device 4804, in some examples, remote storage device 4804 may reside remotely in an alternative "cloud" network.
As depicted in fig. 48, the network orchestrator shown by 4805 serves as the primary information source and information pool (primary source and sink) to perform the primary tasks of the distribution network. In this particular embodiment, network orchestrator 4805 may be implemented in a unified format with other components of the network. However, the tasks depicted by network orchestrator 4805 in fig. 48 form some elements of the disclosed subject matter. Network orchestrator 4805 may also employ a two-way messaging protocol with the client to facilitate all processing and distribution of media according to the characteristics of the client. Furthermore, the bi-directional protocol may be implemented across different transport channels (i.e., control plane channels and data plane channels).
In some examples, network orchestrator 4805 receives information about the characteristics and attributes of one or more client devices 4808 via channel 4807, and further collects requirements regarding applications currently running on 4808. This information may be obtained from device 4804 or, in alternative embodiments, may be obtained by directly querying client device 4808. In the case of a direct query to client device 4808, in one example, it is assumed that a bi-directional protocol (not shown in FIG. 48) exists and is operable so that the client device can communicate directly with network orchestrator 4805.
In some examples, channel 4807 may also update client tracking process 4812 such that a record of the current number and type of client devices 4808 is maintained by the network. Network orchestrator 4815 may calculate one or more priorities for scheduling media adaptation and segmentation process 4810 based on information stored in client tracking process 4812.
Network orchestrator 4805 also initiates and communicates with media adaptation and segmentation process 4810 described in fig. 10. When media adaptation and segmentation process 4810 adapts and segments the ingested media, in some examples, the media is transferred to an intermediate storage device (depicted as storage device 4809 ready to distribute the media). When the distribution media is prepared and stored in the device 4809, the network orchestrator 4805 ensures that the client device 4808 receives the distribution media and corresponding descriptive information 4806 via its network interface 4808B by a "push" request, or the client device 4808 itself may initiate a "pull" request for the media 4806 from the storage device 4809. Network orchestrator 4805 may employ a two-way message interface (not shown in fig. 48) to perform a "push" request or to initiate a "pull" request by client device 4808. In some examples, client 4808 can employ GPU (or CPU not shown) 4808C. The distribution format of the media is stored in a storage device or storage cache 4808D of the client device 4808. Finally, the client device 4808 visually presents media via its visualization component 4808A.
Throughout streaming the immersive media to the client device 4808, the network orchestrator 4805 can check the status of the client progress via the client progress and status feedback channel 4807. The checking of the status may be performed by means of a bi-directional communication message interface (not shown in fig. 48).
Fig. 49 shows a schematic diagram of a media adaptation process 4900 in some examples. The media adaptation process may adapt the ingested source media to match the requirements of one or more client devices, such as client device 908 (depicted in fig. 9). The media adaptation process 4901 may be performed by a plurality of components that facilitate adapting ingest media to an appropriate distribution format for a client device.
In some examples, the network orchestrator 4903 initiates the adaptation process 4901. In some examples, database 4912 may assist network orchestrator 4903 in prioritizing adaptation process 4901 by providing information related to the type and number of client devices (e.g., immersive client devices 4808) attached to the network.
In fig. 49, an adaptation process 4901 receives an input network state 4905 to track current traffic load on the network. The client device information includes attribute and property descriptions, properties and descriptions of the application, and the current state of the application, as well as a client neural network model (if available) to help map the geometry of the client frustum to the interpolation capability of the ingest immersive media. Such client device information may be obtained through a two-way message interface (not shown in fig. 49). The adaptation process 4901 ensures that the adaptation output is stored into the client adapted media storage device 4906 when it is created. In some examples, the scene analyzer 4907 may be performed prior to or as part of a network automation process for distributing media.
In some examples, the adaptation process 4901 is controlled by a logic controller 4901F. The adaptation process 4901 also employs a renderer 4901B or a neural network processor 4901C to adapt the particular ingest source media to a format suitable for the client device. Neural network processor 4901C uses the neural network model in 4901A. Examples of such neural network processors 4901C include deep neural network model generators as described in MPI and MSI. If the media is in 2D format, but the client requires 3D format, the neural network processor 4901C may call a procedure to use the highly correlated images from the 2D video signal to derive a volumetric representation of the scene depicted in the video. An example of a suitable renderer 4901B may be a modified version of an OTOY rendering renderer (not shown) that is to be modified to interact directly with the adaptation process 4901. The adaptation process 4901 optionally employs a media compressor 4901D and a media decompressor 4901E, depending on whether these tools are needed with respect to the format of the ingested media and the format required by the client device.
Some aspects of the present disclosure provide a method that includes directing a prioritization process for adapting media to media requirements of client devices in a network. The method includes collecting profile information for client devices attached to a network. The media requirements of the client device include attributes, in one example. The attribute is, for example, one or more of the following: the format and bit rate of the media. In some examples, the profile information for the client device includes a type and number of client devices attached to the network.
Fig. 50 shows a flowchart outlining a process 5000 according to an embodiment of the present disclosure. The process 5000 may be performed in a network, for example by a server device in the network. In some embodiments, process 5000 is implemented in software instructions, so when the processing circuitry executes the software instructions, the processing circuitry performs process 5000. The process starts from S5001 and proceeds to S5010.
At S5010, profile information of a client device attached to a network is received. The profile information may be used to direct the adaptation of the media to one or more media requirements of the client device to distribute the media to the client device.
At S5020, the first adaptation is prioritized according to the profile information of the client device, the first adaptation adapting the media to the first media requirements for distribution to the first subset of client devices.
In some examples, the media requirements of the one or more media requirements include at least one of a format requirement and a bit rate requirement of the media.
In some examples, the profile information for the client device includes a type of client device and a number of client devices of each type.
In one example, in response to a first number of client devices in the first subset being greater than a second number of client devices in the second subset, adapting the media to a first adaptation of a first media requirement for distribution to the first subset of client devices is prioritized over adapting the media to a second adaptation of a second media requirement for distribution to the second subset of client devices.
Then, the process 5000 proceeds to S5099 and ends.
The process 5000 may be adapted to the various scenarios as appropriate, and the steps in the process 5000 may be adjusted accordingly. One or more steps of process 5000 may be modified, omitted, repeated, and/or combined. Process 5000 may be implemented using any suitable order. Additional step(s) may be added.
While this disclosure has described a number of exemplary embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of this disclosure. It will thus be appreciated that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, embody the principles of the disclosure and are thus within the spirit and scope of the disclosure.

Claims (20)

1. A method of immersive media processing, wherein the method comprises:
a network device receives scene-based immersive media for play on a light field-based display, the scene-based immersive media comprising a plurality of scenes of the immersive media;
assigning priority values to the plurality of scenes in the scene-based immersive media, respectively; and
Determining an order of streaming the plurality of scenes to the terminal device according to the priority value.
2. The method of claim 1, wherein determining the ordering further comprises:
reordering the plurality of scenes according to the priority value; and
the reordered plurality of scenes is transmitted to the terminal device.
3. The method of claim 1, wherein determining the ordering further comprises:
the priority aware network device selecting a highest priority scene having a highest priority value from a subset of untransmitted scenes of the plurality of scenes; and
transmitting the highest priority scene to the terminal equipment.
4. The method of claim 1, wherein assigning the priority value further comprises:
a priority value for a scene is determined based on a likelihood that the scene needs to be rendered.
5. The method of claim 1, further comprising:
determining that available network bandwidth is limited;
selecting a highest priority scene having a highest priority value from a subset of untransmitted scenes in the plurality of scenes; and
the highest priority scenario is transmitted in response to the available network bandwidth being limited.
6. The method of claim 1, further comprising:
determining that available network bandwidth is limited;
identifying a subset of the plurality of scenes that is unlikely to be needed for a next rendering based on the priority value; and
in response to the available network bandwidth being limited, streaming the subset of the plurality of scenes is avoided.
7. The method of claim 1, wherein assigning the priority value further comprises:
a first priority value is assigned to a first scene based on a second priority value of a second scene in response to a relationship between the first scene and the second scene.
8. The method of claim 1, further comprising:
receiving a feedback signal from the terminal device; and
adjusting at least a priority value of a scene of the plurality of scenes based on the feedback signal.
9. The method of claim 8, further comprising:
the highest priority is assigned to a first scene in response to a feedback signal indicating that the current scene is a second scene associated with the first scene.
10. The method of claim 8, wherein the feedback signal indicates a priority adjustment determined by the terminal device.
11. A method of immersive media processing, comprising:
a terminal device having a light field based display receives a Media Presentation Description (MPD) of scene-based immersive media for playback by the light field based display, the scene-based immersive media including a plurality of scenes of immersive media, and the MPD indicating to stream the plurality of scenes to the terminal device in a ranking;
detecting bandwidth availability;
determining a ranking change for at least one scenario based on the bandwidth availability; and
a feedback signal is sent indicating a change in the ordering of the at least one scene.
12. The method of claim 11, wherein the feedback signal indicates a next scene to be rendered.
13. The method of claim 11, wherein the feedback signal indicates an adjustment of a priority value of the at least one scene.
14. The method of claim 11, wherein the feedback signal indicates a current scene.
15. A method of immersive media processing, comprising:
the network device receives scene-based immersive media for play on a light field-based display, a scene in the scene-based immersive media comprising a first ordered plurality of assets; and
A second ordering for streaming the plurality of assets to a terminal device is determined, the second ordering being different from the first ordering.
16. The method of claim 15, comprising:
assigning priority values to the plurality of assets in the scene according to one or more attributes of the plurality of assets, respectively; and
the second ranking for streaming the plurality of assets is determined according to the priority value.
17. The method of claim 16, comprising:
a priority value is assigned to an asset in the scene based on the size of the asset.
18. The method of claim 16, comprising:
a priority value is assigned to an asset in the scene associated with a default entry location of the scene based on the visibility of the asset.
19. The method of claim 16, comprising:
assigning a first set of priority values to the plurality of assets based on a first attribute of the plurality of assets, the first set of priority values being used to rank the plurality of assets in a first prioritization scheme;
assigning a second set of priority values to the plurality of assets based on a second attribute of the plurality of assets, the second set of priority values being used to rank the plurality of assets in a second prioritization scheme;
Selecting a prioritization scheme for the terminal device from the first prioritization scheme and the second prioritization scheme according to the information of the terminal device; and
the plurality of assets are ordered for streaming according to the selected prioritization scheme.
20. The method of claim 16, comprising:
in response to a first asset having a higher computational complexity than a second asset, a first priority value is assigned to the first asset and a second priority value is assigned to the second asset, the first priority value being higher than the second priority value.
CN202380011136.XA 2022-05-12 2023-05-02 Streaming scene prioritizer for immersive media Pending CN117397243A (en)

Applications Claiming Priority (22)

Application Number Priority Date Filing Date Title
US202263341191P 2022-05-12 2022-05-12
US63/341,191 2022-05-12
US202263342532P 2022-05-16 2022-05-16
US202263342526P 2022-05-16 2022-05-16
US63/342,526 2022-05-16
US63/342,532 2022-05-16
US202263344907P 2022-05-23 2022-05-23
US63/344,907 2022-05-23
US202263354071P 2022-06-21 2022-06-21
US63/354,071 2022-06-21
US202263355768P 2022-06-27 2022-06-27
US63/355,768 2022-06-27
US202263416395P 2022-10-14 2022-10-14
US202263416390P 2022-10-14 2022-10-14
US63/416,390 2022-10-14
US63/416,395 2022-10-14
US63/422,175 2022-11-03
US202263428698P 2022-11-29 2022-11-29
US63/428,698 2022-11-29
US18/137,849 2023-04-21
US18/137,849 US20230370666A1 (en) 2022-05-12 2023-04-21 Streaming scene prioritizer for immersive media
PCT/US2023/066479 WO2023220535A1 (en) 2022-05-12 2023-05-02 Streaming scene prioritizer for immersive media

Publications (1)

Publication Number Publication Date
CN117397243A true CN117397243A (en) 2024-01-12

Family

ID=89473483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202380011136.XA Pending CN117397243A (en) 2022-05-12 2023-05-02 Streaming scene prioritizer for immersive media

Country Status (1)

Country Link
CN (1) CN117397243A (en)

Similar Documents

Publication Publication Date Title
US20230319328A1 (en) Reference of neural network model for adaptation of 2d video for streaming to heterogeneous client end-points
US20240292041A1 (en) Adaptation of 2d video for streaming to heterogenous client end-points
US20240179203A1 (en) Reference of neural network model by immersive media for adaptation of media for streaming to heterogenous client end-points
US11570227B2 (en) Set up and distribution of immersive media to heterogenous client end-points
CN116235429B (en) Method, apparatus and computer readable storage medium for media streaming
KR20240063961A (en) Smart client for streaming scene-based immersive media
CN116368808A (en) Bi-directional presentation data stream
CN117397243A (en) Streaming scene prioritizer for immersive media
US20230370666A1 (en) Streaming scene prioritizer for immersive media
US20240104803A1 (en) Scene graph translation
US20230338834A1 (en) Smart client for streaming of scene-based immersive media to game engine
US20240236443A1 (en) Independent mapping space for asset interchange using itmf
CN116997879A (en) Immersive media streaming prioritized by asset reuse frequency
CN116569535A (en) Reuse of redundant assets with client queries
EP4427191A1 (en) Immersive media data complexity analyzer for transformation of asset formats

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination