CN118103802A

CN118103802A - Interactive anchor in augmented reality scene graph

Info

Publication number: CN118103802A
Application number: CN202280069971.4A
Authority: CN
Inventors: V·阿莱姆; 卡洛琳·贝拉德; P·希特兹林; P·乔伊特; 马蒂厄·弗拉代
Original assignee: InterDigital CE Patent Holdings SAS
Current assignee: InterDigital CE Patent Holdings SAS
Priority date: 2021-10-06
Filing date: 2022-10-03
Publication date: 2024-05-28

Abstract

An augmented reality scene description is provided that includes relationships between real objects and virtual objects and interactive triggering and processing of the augmented reality scene; an augmented reality system may read the scene description and run a corresponding augmented reality application. The scene description includes a scene graph that structures descriptions of the real objects and virtual objects that are linkable to the media content item. The scene description also includes anchors that describe triggers for performing actions on the media content items, on the scene description itself, and/or on remote devices and services.

Description

Interactive anchor in augmented reality scene graph

1. Technical field

The present principles relate generally to the field of augmented reality scene description and augmented reality rendering. The present document is also understood in the context of format settings and playback when an augmented reality application is rendered on an end user device such as a mobile device or a Head Mounted Display (HMD).

2. Background art

This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present principles, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present principles. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.

Augmented Reality (AR) is a technology that enables an interactive experience in which a real world environment is enhanced by virtual content that can be defined across multiple sensory modalities including visual, auditory, haptic, etc. During the runtime of an application, virtual content (e.g., 3D content or audio files) is rendered in real-time in a manner consistent with the user context (environment, viewpoint, device, etc.). Scene graphs (such as, for example, the scene graph of glTF by Khronos and its extensions defined in the MPEG scene description format, or the scene graph of USDZ by Apple) are possible ways of representing content to be rendered. They incorporate, on the one hand, a declarative description of the scene structure linking the real environment objects and the virtual objects, and, on the other hand, a binary representation of the virtual content. The dynamics of an AR system using such a scene graph is embedded in an AR application specific to a given AR scene. In such applications, the virtual content item may be changed or adjusted, but the timing and triggering of AR content rendering is controlled by the application itself and cannot be exported to another application.

An AR system is lacking that can employ a description of AR scenes including links between real and virtual elements, and a description of dynamics of the AR experience to be rendered.

3. Summary of the invention

The following presents a simplified summary of the inventive principles in order to provide a basic understanding of some aspects of the inventive principles. This summary is not an extensive overview of the principles of the invention. It is not intended to identify key or critical elements of the principles of the invention. The following summary merely presents some aspects of the principles of the invention in a simplified form as a prelude to the more detailed description provided below.

The present principles relate to a method for rendering an augmented reality scene for a user in a real environment. The method comprises the following steps:

-obtaining a description of the augmented reality scene, the description comprising:

● A scene graph of the link nodes; and

● An anchor associated with at least one node of the scene graph and comprising:

-at least one trigger, the trigger being a description of at least one condition;

when at least one condition of a trigger is detected in the real environment, the trigger is activated;

-at least one action, the action being a description of a process to be performed by the augmented reality engine;

-loading at least a portion of the media content items linked to the nodes of the scene graph;

-observing the augmented reality scene; and

The at least one action of the anchor is applied to the at least one node associated with the anchor, on condition that at least one trigger of the anchor is activated.

The present principles also relate to an augmented reality rendering device including a memory associated with a processor configured to implement the above described method.

The present principles also relate to a data stream representing an augmented reality scene and comprising:

-a description of the augmented reality scene, the description comprising:

● A scene graph of the link nodes; and

-at least one action, the action being a description of a process to be performed by the augmented reality engine; and

Media content items that are linked to nodes of the scene graph.

4. Description of the drawings

The disclosure will be better understood and other specific features and advantages will appear upon reading the following description, with reference to the drawings in which:

FIG. 1 shows an example augmented reality scene graph;

FIG. 2 shows a non-limiting example of an AR scene graph description according to a non-limiting implementation of the principles of the present invention;

FIG. 3 shows an example architecture of a device that may be configured to implement the methods described with respect to FIGS. 5 and 6, according to a non-limiting implementation of the principles of the present invention;

FIG. 4 shows an example of implementation of the syntax of a data stream encoding an augmented reality scene description according to the principles of the invention;

FIG. 5 illustrates a method for rendering an augmented reality scene according to a first implementation of the principles of the present invention;

FIG. 6 illustrates a method 60 for rendering an augmented reality scene according to a second implementation of the principles of the present invention;

Figure 7 shows an exemplary scene description in a first format according to the principles of the present invention.

Figure 8 shows another example scene description of a first format according to the principles of the present invention.

Figure 9 shows an example scene description in a second format according to the principles of the present invention.

5. Detailed description of the preferred embodiments

The present principles will be described more fully hereinafter with reference to the accompanying drawings, in which examples of the principles of the invention are shown. The principles of the present invention may, however, be embodied in many alternate forms and should not be construed as limited to the examples set forth herein. Thus, while the principles of the invention are susceptible to various modifications and alternative forms, specific examples thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intention to limit the principles of the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the principles of the invention as defined by the claims.

The terminology used herein is for the purpose of describing particular examples only and is not intended to be limiting of the principles of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. In addition, when an element is referred to as being "responsive" or "connected" to another element, it can be directly responsive or connected to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly responsive" or "directly connected" to another element, there are no intervening elements present. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items, and may be abbreviated as "/".

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the teachings of the present principles.

Although some of the illustrations include arrows on communication paths to show the primary communication direction, it should be understood that communication may occur in a direction opposite to the depicted arrows.

Some examples are described with respect to block diagrams and operational flow diagrams, in which each block represents a circuit element, module, or portion of code that comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in other implementations, the functions noted in the blocks may occur out of the order noted. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Reference herein to "in accordance with one example" or "in one example" means that a particular feature, structure, or characteristic described in connection with the example may be included in at least one implementation of the principles of the invention. The appearances of the phrase "in accordance with one example" or "in one example" in various places in the specification are not necessarily all referring to the same example, nor are separate or alternative examples mutually exclusive of other examples.

Reference numerals appearing in the claims are by way of illustration only and do not limit the scope of the claims. Although not explicitly described, the present examples and variations may be employed in any combination or sub-combination.

Fig. 1 illustrates an exemplary augmented reality scene graph 10. In this example, the scene graph includes a description 12 of a real object (e.g., a 'planar horizontal surface' (which may be a table or floor or board)) and a description 13 of a virtual object (e.g., an animation of a walking character). The scene graph node 13 is associated with a media content item 14, which is an encoding of data (e.g., as a textured animated 3D mesh) required to render and display a walking character. Scene graph 10 also includes node 11, which is a description of the spatial relationship between the real objects described in node 12 and the virtual objects described in node 13. In this example, node 11 describes a spatial relationship that causes a character to walk on a planar surface. When the AR application is launched, the media content item 14 is loaded, rendered, and buffered for display upon triggering. Once the sensor (camera in the example of fig. 1) detects a planar surface in the real environment, the application displays the buffered media content item as described in node 11. Such a scene graph does not describe the timing and triggering of the AR application. Timing and triggering are inherent steps of an application. More complex triggers can be programmed to, for example, wait for an action from the user. Since the AR experience timing and triggering may be different, the performance of the different applications to which the scenegraph 10 is provided will be different.

The nodes of the scene graph may also not include descriptions, but merely act as parent nodes of the child nodes.

AR applications are diverse and can be applied to different contexts and real environments. For example, in an industrial AR application, when a camera installed on a head mounted display device detects a reference object (part B of an engine) in a real environment, a virtual 3D content item (e.g., part a of the engine) is displayed. The 3D content item is positioned in the real world at a defined position and scale relative to the detected reference object.

In an AR application for interior design, when a given image from a catalog is detected in an input camera view, a 3D model of furniture is displayed. The 3D content is located in the real world at a defined position and scale relative to the detected reference image. In another application, playing of a certain audio file may begin when a user enters an area near a church (either real or virtually rendered in an enhanced real environment). In another example, an advertising song (ad jingle) file may be played when the user sees a given can of soda in the real environment. In outdoor gaming applications, various virtual characters may appear depending on the semantics of the scene observed by the user. For example, the bird character is adapted to the tree, so if the sensor of the AR device detects a real object described by the semantic tag 'tree', birds flying around the tree can be added. In a companion application implemented by smart glasses, when an automobile is detected within the field of view of the user's camera, automobile noise may be initiated in the user's headset to alert him of a potential hazard; furthermore, the sound may be spatially rendered so as to arrive from the direction in which the car is detected.

Fig. 2 shows a non-limiting example of an AR scene description in accordance with the principles of the present invention. The scene graph as described with respect to fig. 1 is enhanced with a set of anchors associated with nodes of the scene graph. Anchors are elements of a scene description that specify the relationship between an AR scene (multi-modal media) and the real world (environment and user). These anchors provide information of when and/or under what conditions an action must be performed. The action may involve loading of the media content item, rendering and playing of the media content item, or modification of the scene description itself.

The anchor is associated with a node of the scene graph and includes two information elements:

-at least one trigger. Triggers describe a set of joint conditions such as detection of a given image, detection of an object with a given semantic or a given geometric characteristic, a specific user interaction, a user entering a given region of real space, a timer value reaching a specific value, etc. Triggering is activated when a sensor of the AR system detects a described condition in the real and/or virtual environment; and

-A set of actions describing what happens to the portion of the AR scene corresponding to the associated node when at least one trigger is activated: upload, start, stop, update, connect to a server, notify, update a scene graph, etc.

Since these anchors specify author-defined behaviors (triggers and actions), it is possible to precisely define when and how the rendering engine should handle the nodes of the AR scene graph at runtime.

In the example of fig. 2, a set of anchors are defined within field Jing Miaoshu. Any node or group of nodes of the scene graph may be associated with one or several anchors. When a node is associated with several anchors, information in the node may indicate how to handle the anchor trigger (satisfy only the first, only the last, all together). When the trigger of the anchor is activated, the action performed on that node affects the child node. If a child node has its own anchor, the action of that anchor may prevent the child node from being affected by the action of the anchor of its parent node (e.g., in the case of repositioning the media content item). For example, node E is associated with anchors 21 and 22 (with an 'OR' combination). As child nodes of E, nodes H and I may be involved in the actions of anchors 21 and 22. For example, node C is associated with anchor 23. But also node F as a child of node C. Node G is also a child of node C but is associated with anchor 24. In the example of fig. 2, nodes A, B and D are not associated with any anchors.

In the example of fig. 2, anchor 21 includes one trigger and three associated actions. The anchor 22 includes: two triggers with an and combination; and two actions, such as detection and time range of a given object [1:05pm to 1:15pm ] and two actions. These two actions are only performed when a given object is detected and the current time is within the time range. The anchor 23 includes: one trigger identical to the second trigger of anchor 22: time range [1:05pm to 1:15pm ]; and an action such as playing a ringing tone in the headset described in node C (with the left speaker of node F and the right speaker of node G). In this example, the ring rings from 1:05pm to 1:15pm in the two speakers of the headset. The anchor 24 has: a trigger, for example, detection by a front door camera that a person is at a front door of a house; and one action identical to that of anchor 23, i.e. playing a ringing tone, but only in the right speaker as described by node G associated with anchor 24.

Triggering

Triggering a condition specifying the occurrence of an action. It may be based on environmental characteristics, user input, or timers. It may also be adjusted according to constraints like the type of rendering device or user profile. There are several types of triggers.

The context-based trigger is related to the user context and depends on data captured by the sensor during runtime. The context-based trigger may be, for example:

* From a visual sensor:

-2D labeling: detection of a given 2D image (described or referenced in the trigger);

-3D labeling: detection of a given 3D object (described or referenced in the trigger);

-visual signature: detection of a particular feature point arrangement (e.g., generated or provided by another user);

-geometric properties: verification of geometric properties (e.g., vertical planes);

-semantic properties: detection of objects with semantic properties (e.g., faces, trees, etc.);

* From the audio sensor:

-audio tagging: detection of specific noise (signal described or referenced in the trigger, or signal that can be identified by audio feature extraction);

-audio characteristics: detection of a type of noise, which may be characterized by various means, such as semantics (e.g., noise of an automobile) or periodicity (periodic beeps)

Or other means.

The environment-based trigger may be activated by any sensor providing information about the real environment, such as a temperature sensor, an inertial measurement unit, GPS, a moisture sensor, an IR or UV light sensor, a wind sensor, etc.

When the trigger depends on the detection of a particular item in the real world, a model or semantic description of the particular item is described in or appended to the trigger. For example, in the case of a 2D mark, a 3D mark or an audio mark, respectively, a reference 2D image, a reference 3D mesh or a reference audio file is attached to the anchor.

In the case where an item to be detected is described by a semantic description (which may be, for example, a series of word tokens), the processing is performed as follows:

if the application processes only virtual scenes based on the scene graph (possibly updated from time to time by the application or by other means), as in VR gaming applications, for example, the semantic descriptions are accurate enough to accurately identify virtual items that appear in the scene graph during the application runtime.

If the application processes the mixed reality application based on the scene graph (possibly updated from time to time by the application or by other means), the semantic descriptions are sufficiently accurate to accurately identify items that appear in the scene graph or in the real scene linked to the scene graph. For example, an application may rely on a scene graph to manage virtual objects placed in a real scene that the application intends to view or analyze through an associated sensor (e.g., a 2D camera or a 3D sensor such as LiDAR).

When a matching item is detected (in the virtual or real world corresponding to the scene graph), the application transforms the corresponding graph position in accordance with the position and orientation of the detected item so as to be able to ultimately anchor any linked nodes to the identified pose. Depending on the sensor or method used to detect a matching real item, the transformation may be performed according to different techniques. For example, if the detection is based on a 2D camera sensor and image-based detection of objects matching the semantic description, the transformation processes the 2D object positions in the 2D camera image plus relative device (camera) positions to estimate the pose of the node in the scene graph 3D space from the detected items. In some cases, additional transformations are applied to provide final matching-related poses that semantically match the real object in the scene graph, to provide a scene pose centered on the object, or to apply pose correction to some type of object. In practice, it may be useful to provide a pose of the center of the matching object, for example, if the matching object is large. Or providing a pose corrected for a matching object may be useful to obtain an anchoring pose that differs from the detected object-in orientation. For example, the exact semantic description of the item may be (without limitation) "open hands", "smiling face", "red carpet", "game board", and so forth.

In the event that multiple items match the semantic descriptions at the same time, the application may decide, but is not limited to, to follow built-in policies or to process the items according to policies based on metadata defined at the scene level or defined separately at particular nodes of the scene graph. For example, possible strategies may be: in that multiple detections are ignored, because the semantic description is not accurate/distinguishable enough; or in that each match is processed by copying the anchored node to each matching item location (in some example, some copying of the nodes and possibly their children is triggered); or in that one of these matches is selected taking into account some proximity criteria, treated as the only one, and all other matches are ignored. Different strategies may also be combined. The proximity criterion may be a distance between a matching location in a scene and another node of the scene (or a user located in an area of the correspondences graph scene).

The trigger event may be stable in time or conversely may be instantaneous. For example, in the case of defining a semantic description of an anchor, the corresponding item may be observed for a limited period of time. For example, upon detecting a semantic match and estimating its pose, the application may first trigger the anchoring of the relevant node, and when the match is no longer observed, the anchor may later be reset to an unsatisfied criterion that results in a disarming operation (the previously anchored node returns to the default pose provided in the scene graph, as if it were never anchored).

The markup (2D, 3D, or semantic description) can correspond to a mobile item, causing an application to manage this feature according to various ways. For example, when a marker is detected, the marker pose is estimated once, and an anchor is applied to the node of the once estimated position (even if the marker pose or position changes). Or when a marker is detected and updated by an application or by other means, anchors are applied to nodes at this periodically estimated position, potentially updating the node pose in the scene graph as the marker moves. Combinations of markers can be defined and used to estimate anchor positions. A series of candidate markers may be provided and the pose of the anchor may be estimated when one of these markers is detected (and its pose estimated), or when all (or a given number of) markers are detected and its pose estimated (making the final anchor pose estimation robust or accurate). In case a combination (aggregation) of markers is used to estimate the position of the anchor, a relative layout between these markers is provided. One of these markers may be defined as the combined reference marker (e.g., the first provided marker), and the relative pose of all other markers may be used in estimating the pose of the anchor. The final anchor position is given relative to the reference mark.

In an embodiment, minimal space around the marker may be required to anchor the item and estimate its pose. This information may be defined as, for example, surrounding a cube (e.g., in meters) or surrounding a sphere. This can be used to manage multiple detections of items, for example when the items are defined by a semantic description.

The user-based trigger is related to user behavior and may be, for example:

input user interactions (e.g. user flick gesture, or user voice command, or user gaze direction remains unchanged for a certain duration, user picks up or moves objects of real or virtual environment, etc.);

-the location of the user within a given area (e.g. close to a given object, located in a building or room);

-a view of the user corresponding to a given direction or containing a given spatial point;

the external trigger is based on information provided by an external application or service, such as for example:

Alert/information from the connected object (e.g. temperature above threshold);

Notifications from the current application (e.g. collision detected);

notification from another application/service (e.g. receipt of text message or weather);

Time of day;

-time elapsed since starting a given media (part of a scene description);

the time elapsed since the scene description was processed (i.e. the duration of use of the scene description);

The trigger is an information element that includes descriptors describing the nature of the trigger (related to the type of sensor required for activation of the trigger) and each parameter required to detect a particular occurrence of the trigger. The additional descriptor may indicate whether the action should continue once the trigger is no longer activated.

In an embodiment, some types of triggers may include descriptors describing constraints under which activation of the trigger becomes possible. For example, a time range of [9:00am,10:00am ] may have a limit set to 8:55am, indicating that a trigger will be activated soon. The constraint may be used to load media content items linked to nodes of the scene graph describing virtual objects only when a trigger is to be activated. In this embodiment, media content items of the virtual object are loaded, decoded, ready to render, and buffered only when they are likely to be used. This saves memory and processing resources by only processing and storing media content items to which the action of the anchor is likely to involve.

In another example, consider a node N2 linked to an anchor defining a spatial trigger TR that defines a threshold distance criterion TDc between two objects. For example, TDc represents the minimum distance Dm between two objects described by the two nodes N3 and N4. When the distance between two objects becomes smaller than Dm for any reason, the trigger is activated and the action is applied to node N2. The trigger TR may include a descriptor describing a constraint, which may be a distance Dd greater than Dm or a percentage p >1.0 (dd=p×dm). When the viewing distance between the two objects described by N3 and N4 becomes less than Dd, the media content item linked to N2 is loaded, decoded, ready to be rendered and buffered. Conversely, when the constraint is no longer met, the media content item is unloaded from the memory. In this example, when the viewing distance between the two objects described by N3 and N4 is greater than Dd, the media content item linked to N2 is offloaded.

Accordingly, in the event that the probability of triggering activation becomes low (in the example of spatial triggering, the user moves gradually away from the node that is only activated when the user is very close to it), the associated media, active connection may be offloaded and/or released to avoid wasting storage or connection resources.

Action

The actions may be:

-initiating media processing: rendering video, playing sound, rendering haptic feedback, or displaying virtual content items;

-stopping or suspending media processing;

Modifying the scene description (the scene description may be dynamically updated to reflect desired changes, like moving objects, new/disappeared objects, etc.), thereby modifying the scene graph and/or anchors; the anchor may change the position (pose, orientation) of the individual nodes or nodes plus their child nodes (in a hierarchical representation of the media). Portions of the scene (branches of nodes in the hierarchical representation) may be replicated. The rendering of some nodes (or branches of nodes) may be hidden until some triggers and related conditions are met.

-Establishing a network connection making the media content accessible;

Remote notification

A request for a scene description update (which may be a local request to the user's application, or a request to connect to a remote server using a network, or other request).

An action is an information element that includes descriptors that describe the nature of the action (related to the type of rendering device (i.e., display, speaker, vibrator, heat sensitive device (heat), etc.) required for execution of the action) and each parameter required for a particular occurrence of the action. The additional descriptors may indicate the manner in which the media must be rendered. Some descriptors may depend on the media type:

-3D content: gesture and scale (which may be defined with respect to triggers);

-audio: volume, length, circulation;

-tactile sensation: position, type, intensity;

-a delay before initiating an action;

Delay before the action is considered outdated (and ignored);

for these descriptors, default values may be indicated or defined in the memory of the AR system.

In accordance with the principles of the present invention, an AR scene description as described above may be loaded from memory or from a network by an AR system equipped with sensors, including at least an AR processing engine.

In an embodiment, the anchor may not include an action. The action of the anchor may be determined by the nature of the trigger by default. For example, if the nature of the trigger is marker detection, then the action defaults to a put action. In this embodiment, one type of triggered default action is stored in the memory of the AR processing engine.

Fig. 3 illustrates an exemplary architecture of an AR processing engine 30 that may be configured to implement the methods described with respect to fig. 5 and 6. Devices according to the architecture of fig. 3 are linked with other devices via their bus 31 and/or via I/O interface 36.

The device 30 comprises the following elements linked together by a data and address bus 31:

A microprocessor 32 (or CPU), for example a DSP (or digital signal processor);

-ROM (or read only memory) 33;

-RAM (or random access memory) 34;

-a storage interface 35;

an I/O interface 36 for receiving data to be transferred from an application; and

A power source (not shown in fig. 3), such as a battery.

According to an example, the power supply is external to the device. In each of the memories mentioned, the word "register" used in the specification may correspond to a small capacity region (some bits) or a very large region (e.g., the entire program or a large amount of received or decoded data). The ROM 33 includes at least programs and parameters. ROM 33 may store algorithms and instructions for performing the techniques according to the principles of the present invention. When turned on, the CPU 32 uploads the program in the RAM and executes the corresponding instruction.

RAM 34 includes programs in registers that are executed by CPU 32 and uploaded after the device 30 is turned on, input data in registers, intermediate data in different states of the method in registers, and other variables used to execute the method in registers.

Implementations described herein may be implemented in, for example, a method or process, an apparatus, a computer program product, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (e.g., discussed only as a method or apparatus), the implementation of the features discussed may also be implemented in other forms (e.g., a program). The apparatus may be implemented in, for example, suitable hardware, software and firmware. The method may be implemented, for example, in an apparatus (such as, for example, a processor) generally referred to as a processing device, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end users.

The device 30 is linked to a set of sensors 37 and a set of rendering devices 38, for example via a bus 31. The sensor 37 may be, for example, a camera, a microphone, a temperature sensor, an inertial measurement unit, a GPS, a moisture sensor, an IR or UV light sensor, or a wind sensor. Rendering device 38 may be, for example, a display, a speaker, a vibrator, a thermal device, a fan, etc.

According to an example, the device 30 is configured to implement the method described with respect to fig. 5 and 6, and belongs to a set comprising:

-a mobile device;

-a communication device;

-a gaming device;

-a tablet (or tablet);

-a laptop;

-a still picture camera;

-a camera.

Fig. 4 illustrates an example of an implementation of a syntax for encoding a data stream of an augmented reality scene description according to principles of the invention. Fig. 4 shows an exemplary structure 4 of an AR scene description. The structure is contained in a container that organizes the stream in independent elements of syntax. The structure may include a header portion 41 that is a set of data common to each syntax element of the stream. For example, the header portion includes some metadata about the syntax elements describing the nature and role of each of them. The structure also includes a payload comprising syntax element 42 and syntax element 43. Syntax element 42 includes data representing media content items described in nodes of the scene graph that are related to virtual elements. The image, grid and other raw data may have been compressed according to a compression method. Syntax element 43 is part of the payload of the data stream and includes data encoding a scene description as described with respect to fig. 2.

Fig. 5 illustrates a method 50 for rendering an augmented reality scene according to a first implementation of the principles of the present invention. At step 51, a scene description is obtained. In accordance with the principles of the invention, a scene description includes a scene graph linking nodes, a description of a real object or a description of a virtual object. Nodes describing virtual objects may be linked to media content items. The nodes of the scene graph may also describe relationships between other nodes. The scene description also includes an anchor. The anchor includes at least one trigger describing under what conditions the anchor is activated and at least one action. A node of the scene graph may be associated with one or several anchors. In a first implementation of the present principles, at step 52, each media content item linked to a node of the scene graph is loaded, ready to be rendered, and buffered. At step 53, the augmented reality system observes the real environment through a set of various types of sensors. At step 54, the augmented reality system analyzes the input from the sensor to verify whether the condition described by the triggering of the anchor is met. If yes, the satisfied trigger is activated and step 55 is performed, otherwise steps 53 and 54 are iteratively performed. The processing engine examines the implementation of triggers listed in the scene description based on the user's input and any application information (time, etc.) involved in the scene rendering, and refreshes these examination conditions as the scene description is updated. If the anchor includes several triggers, the anchor may include descriptors that order the triggers and examine them by increasing cost (e.g., in terms of processing resources). For example, if the anchor includes both a time range and object detection, the time range is checked first (because this is an easy and quick check), and the object detection is checked only if the time range condition is satisfied. At step 55, the action of the activated triggered anchor is applied to the node of the scene graph associated with the anchor. As described with respect to fig. 2, the action may be, for example, starting, pausing, or stopping playing the buffered media content item, or modifying a scene description, or communicating with a remote device. The anchor may include a descriptor indicating when and under what conditions the action must be stopped or continued.

Fig. 6 illustrates a method 60 for rendering an augmented reality scene according to a second implementation of the principles of the present invention. At step 61, a scene description is obtained. In this second embodiment, at least one trigger of the anchor of the scene description includes a descriptor indicating the constraints described with respect to fig. 2. At step 62, media content items linked to nodes associated with anchors whose triggers have no restrictions are loaded. Media content items linked to nodes associated with anchors having constraints are not loaded. Then, at step 53, the system begins to observe the real environment by analyzing the input from its sensors. If the trigger constraints of the anchor are met at step 63, media content items linked to the node associated with the anchor are loaded at step 64. If the trigger constraints of the anchor are no longer met, at step 65, data relating to media content items linked to the node associated with the anchor is offloaded from memory. In a variant, the triggering of the anchor comprises a descriptor indicating that data relating to the loaded media content item has to be saved in memory, and step 65 is skipped. If the constraint is not met or is no longer met, step 54 is performed. The augmented reality system then analyzes the input from the sensor to verify whether the condition described by the triggering of the anchor is met at step 54. If so, the satisfied trigger is activated and step 55 is performed, otherwise step 53 is performed iteratively.

Fig. 7 illustrates an exemplary scene description in a first format in accordance with the principles of the present invention. In this exemplary format, anchors are described in the index list. Each anchor includes a description of its trigger and action. The nodes associated with the anchors include descriptors with indexes of the associated anchors.

Fig. 8 illustrates another exemplary scene description of a first format in accordance with the principles of the invention.

Fig. 9 illustrates an exemplary scene description in a second format in accordance with the principles of the invention. In this format, triggers and actions are described in two index lists. The anchor is described in a third list of indices and includes an index of triggers (in the first list) and an index of actions of the anchor (in the second list). The nodes associated with the anchors include descriptors with indices (in the third list) of the associated anchors.

Implementations described herein may be implemented in, for example, a method or process, an apparatus, a computer program product, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (e.g., discussed only as a method or apparatus), the implementation of the features discussed may also be implemented in other forms (e.g., a program). The apparatus may be implemented in, for example, suitable hardware, software and firmware. The method may be implemented, for example, in an apparatus (such as, for example, a processor) generally referred to as a processing device, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices such as, for example, smart phones, tablets, computers, mobile phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end users.

Implementations of the various processes and features described herein may be embodied in a variety of different equipment or applications, particularly equipment or applications associated with, for example, data encoding, data decoding, view generation, texture processing, and other processing of images and related texture information and/or depth information. Examples of such equipment include encoders, decoders, post-processors that process output from the decoders, pre-processors that provide input to the encoders, video decoders, video codecs, web servers, set-top boxes, laptops, personal computers, cell phones, PDAs, and other communication devices. It should be clear that the equipment may be mobile, even mounted in a mobile vehicle.

Additionally, the methods may be implemented by instructions executed by a processor, and such instructions (and/or data values resulting from the implementation) may be stored on a processor-readable medium, such as, for example, an integrated circuit, a software carrier, or other storage device, such as, for example, a hard disk, a compact disk ("CD"), an optical disk (such as, for example, a DVD, commonly referred to as a digital versatile disk or digital video disk), a random access memory ("RAM"), or a read only memory ("ROM"). The instructions may form an application program tangibly embodied on a processor-readable medium. The instructions may be, for example, hardware, firmware, software, or a combination. The instructions may be found, for example, in an operating system, an application program alone, or a combination of both. Thus, a processor may be characterized as, for example, a device configured to perform a process and a device comprising a processor-readable medium (such as a storage device) having instructions for performing a process. Further, a processor-readable medium may store data values resulting from an implementation in addition to or in lieu of instructions.

It will be apparent to those skilled in the art that implementations may produce various signals formatted to carry, for example, storable or transmittable information. The information may include, for example, instructions for performing a method or data resulting from one of the implementations. For example, the signals may be formatted as data carrying rules for writing or reading the syntax of the described embodiments, or as data carrying actual syntax values written by the described embodiments. Such signals may be formatted, for example, as electromagnetic waves (e.g., using the radio frequency portion of the spectrum) or as baseband signals. Formatting may include, for example, encoding the data stream and modulating the carrier with the encoded data stream. The information carried by the signal may be, for example, analog or digital information. It is well known that signals may be transmitted over a variety of different wired or wireless links. The signal may be stored on a processor readable medium.

A number of implementations have been described. It will be appreciated that many modifications may be made. For example, elements of different implementations may be combined, supplemented, modified, or removed to produce other implementations. In addition, one of ordinary skill will understand that other structures and processes may be substituted for those disclosed and that the resulting implementations will perform at least substantially the same functions in at least substantially the same ways to achieve at least substantially the same results as the implementations disclosed. Accordingly, the present application contemplates these and other implementations.

Use example

1. The content provider may know in advance what the anchor looks like (2D anchor image or 3D anchor model). In this case, the image or model would be placed into the scene description and delivered to the client (along with the scene description and renderable volume content or other renderable content). The client will find the anchor and use it as a trigger to render the associated content and place the associated content into the scene.

A. Examples: colas trigger the playing of visual or audio advertisements or 'advertising songs' songs.

B. Examples: an art museum. The content provider knows the environment in advance and the anchors can be different images of artwork in the museum. When a drawing or sculpture comes into view of the user, detection of the anchor will trigger an interactive visual/volumetric experience.

C. Examples: a living room of the user. The same environment may be known from just past observations of the environment from the client, e.g., the content creator may know what the user's living room looks like, because the user always wears a mixed reality headset in the living room, so various useful anchors from the known environment will be available for the scene description. (the available anchors may be determined by the user's client and reported to the server as anchors; or the client may only periodically send a scan of the environment to the server, and the appropriate anchors may be determined at the server side).

2. The content provider may only know what will trigger a general (or "fuzzy") description of the placed content, e.g., if the client detects object X or environmental characteristic Y, this detection will trigger the placement of the content. No accurate 2D image or 3D model is available, but some semantic description of the triggering object or environmental characteristic will be put into the 'scene description anchor' description and the client uses the semantic description with some object or environmental characteristic detection algorithm.

A. examples: the enhanced Bao dream Go (Pok mon Go) application. The server provides a volumetric representation of each virtual treasured dream and an associated scene description associates each treasured dream to some semantic description of where it is best to place the treasured dream. For example:

i. for the ratio carving, anchor = tree.

For a riot carp, anchor = river or lake (or water area)

For carbis, anchor = bed or couch

B. Examples: virtual refrigerator magnet application. The contents simulate a photograph or a magnet attached to the front door of the refrigerator. The anchor is to detect the refrigerator in the environment and then anchor the content to the front door of the refrigerator. This is an interesting example because the description is semantic (we have no specific image of the refrigerator we are looking for), but the content will still need to be anchored to a specific area of the detected refrigerator (e.g. the top half of the refrigerator door, near the level of the eyes. You will not want to place the content on the side or back of the refrigerator, and the content should not be placed down near the floor).

3. Multiple users in the same environment. This is an interesting example because the anchors in the scene description enable to be consistent for multiple users (i.e. aligned with the same environmental anchor)

Virtual content is placed so that a user can interact consistently with the content. This would be straightforward if the environment was known in advance. However, if the environment is not known in advance, the situation becomes more interesting.

A. Examples: virtual chess game

I. the user 1 receives a scene description of an object with a board and pawns, and in the initial description the anchor is semantic-it gives a requirement for a flat surface above the floor, such as a table or a countertop.

User 1's client detects the appropriate surface (coffee table in a real local environment)

And places the content so that user 1 can start the experience (place the board on a coffee table with white pawns facing user 1).

The client of user 1 detects an anchor feature suitable for anchoring the board to the coffee table.

This may use the texture or corners/edges of the table, depending on what provides enough detail to act as an anchor. The client uses these persistent features to maintain alignment of the board with the table.

The client of user 1 reports the determined anchor back to the server or content provider along with details of how the content is aligned with respect to the determined anchor content. (the anchor may be forwarded as a 2D image or 3D model, depending on the client capabilities).

User 2 joins the game and retrieves the content and scene description from the server. The scene description is now updated to include more specific anchors (e.g. 2D images or 3D models of coffee table anchors, together with relative alignment information describing how the content as chessboard and pawns will be aligned with respect to the anchors) based on what the user 1 client provides to the server.

The client of user 2 renders the checkerboard using the anchor and alignment information provided by the server. Thus, user 2's client is able to place content in the same real world position/orientation as user 1's client, and both can play games with consistent view of virtual content anchored to the real world

Claims

1. A method for rendering an augmented reality scene for a user in a real environment, the method comprising:

● A scene graph of the link nodes; and

● An anchor associated to at least one node of the scene graph and comprising:

-loading at least a portion of a media content item linked to the node of the scene graph;

-observing the augmented reality scene; and

-Applying said at least one action of the anchor to said at least one node associated with the anchor on condition that at least one trigger of the anchor is activated.

2. The method of claim 1, wherein the at least one trigger of an anchor associated with at least one node comprises a constraint, and wherein media content item is linked to the at least one node and is loaded only when the constraint is observed in the augmented reality scene.

3. The method of claim 2, wherein the media content item linked to the at least one node associated with the anchor is offloaded when a triggering constraint of the anchor is no longer observed in the augmented reality scene.

4. A method according to one of claims 1to 3, wherein the trigger of an anchor comprises a descriptor indicating whether to continue the at least one action of the anchor once the trigger is no longer activated.

5. The method of one of claims 1to 4, wherein a trigger comprises at least two conditions, and wherein the trigger comprises a descriptor indicating how to combine the at least two conditions.

6. The method according to one of claims 1 to 5, wherein the at least one condition triggered belongs to a group of conditions comprising:

-context-based triggering;

-user based triggering; and

-An external trigger.

7. The method according to one of claims 1 to 5, wherein a trigger depends on the detection of an object in the real environment, and wherein the trigger is associated with a model of the object or with a semantic description of the object.

8. The method of one of claims 1 to 7, wherein the at least one action triggered belongs to a group of actions comprising:

-playing, pausing or stopping a media content item;

-modifying the description of the augmented reality scene; and

-Connecting a remote device or service.

9. The method of one of claims 1 to 8, wherein an anchor does not include an action, the at least one action of the anchor being determined by default according to a type of the at least one trigger of the anchor.

10. An apparatus for rendering an augmented reality scene for a user in a real environment, the apparatus comprising a memory associated with a processor configured to:

● A scene graph of the link nodes; and

-observing the augmented reality scene; and

-Applying said at least one action of the anchor to said at least one node associated with the anchor, on condition that said at least one trigger of the anchor is activated.

11. The device of claim 10, wherein the at least one trigger of an anchor associated with at least one node comprises a constraint, and wherein media content item is linked to the at least one node and is loaded only when the constraint is observed in the augmented reality scene.

12. The apparatus of claim 11, wherein the media content item linked to the at least one node associated with an anchor is offloaded when a triggered constraint of the anchor is no longer observed in the augmented reality scene.

13. The device of one of claims 10 to 12, wherein the trigger of an anchor comprises a descriptor indicating whether to continue the at least one action of the anchor once the trigger is no longer activated.

14. The device of one of claims 10 to 13, wherein a trigger comprises at least two conditions, and wherein the trigger comprises a descriptor indicating how to combine the at least two conditions.

15. The device of one of claims 10 to 14, wherein the at least one condition triggered belongs to a group of conditions comprising:

-context-based triggering;

-user based triggering; and

-An external trigger.

16. The device of one of claims 10 to 14, wherein a trigger depends on the detection of an object in the real environment, and wherein the trigger is associated with a model of the object or with a semantic description of the object.

17. The device of one of claims 10 to 16, wherein the at least one action triggered belongs to a group of actions comprising:

-playing, pausing or stopping a media content item;

-modifying the description of the augmented reality scene; and

-Connecting a remote device or service.

18. The device of one of claims 10 to 17, wherein an anchor does not include an action, the at least one action of the anchor being determined by default according to a type of the at least one trigger of the anchor.

19. A data stream representing an augmented reality scene and comprising:

-a description of the augmented reality scene, the description comprising:

● A scene graph of the link nodes; and

when at least one condition of a trigger is detected in a real environment, the trigger is activated;

-A media content item, the media content item being linked to a node of the scene graph.

20. The data stream of claim 19, wherein the anchor does not include an action.

21. The data stream according to claim 19 or 20, wherein a trigger depends on the detection of an object in the real environment, and wherein the trigger is associated with a model of the object or with a semantic description of the object.