WO2021263018A1 - Contextual augmentation using scene graphs - Google Patents

Contextual augmentation using scene graphs Download PDF

Info

Publication number
WO2021263018A1
WO2021263018A1 PCT/US2021/038948 US2021038948W WO2021263018A1 WO 2021263018 A1 WO2021263018 A1 WO 2021263018A1 US 2021038948 W US2021038948 W US 2021038948W WO 2021263018 A1 WO2021263018 A1 WO 2021263018A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
objects
room
relationships
target scene
Prior art date
Application number
PCT/US2021/038948
Other languages
French (fr)
Inventor
Mohammad Keshavarzi
Aakash PARIKH
M. Luisa G. CALDAS
Allen Y. Yang
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2021263018A1 publication Critical patent/WO2021263018A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/05Geographic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts

Definitions

  • Fig. 7 shows embodiments of SceneGen placing objects into scenes.
  • Fig. 12 shows a plot of angular distance between the ground truth orientation and the top orientation prediction of SceneGen.
  • Fig. 13 shows scenes of different levels.
  • Fig. 14 shows a graph of the result of users rating plausibility of object placements.
  • Fig. 15 shows the plausibility scores for each object category.
  • Fig. 19 shows an example of an augmented reality application using SceneGen.
  • Virtual objects such as 14 are to be inserted into the existing scene, more than likely for rendering on a display in the augmented reality, or virtual reality, scene.
  • a series of ‘heat maps,’ essentially probability maps of different placements of the object such as 16.
  • the virtual object is then placed into the Scene Graph at 18. This then results in the placement in the AR scene at 20.
  • SceneGen is based on clear, logical object attributes and relationships.
  • the embodiments leverage this approach to encapsulate the relevant object relationships for scene augmentation.
  • Scene Graphs can inform the intelligent placement of virtual objects in physical scenes and will typically be the initial part of the SceneGen process.
  • the embodiments introduce a spatial Scene Graph representation that encapsulates positional and orientational relationships of a scene, different from previous scene graphs.
  • the Scene Graph of the embodiments captures pairwise topology between objects, object groups, and the room.
  • the embodiments develop a prediction model for object contextual augmentation in existing scenes. They construct an explicit Knowledge Model that is trained from Scene Graph representations captured from real-world 3D scanned data.
  • augmented reality will encompass both Augmented and virtual augmentations in Virtual Reality environment.
  • collaborative environments require placing one user’s objects into another user’s surroundings.
  • adding virtual objects to scenes has been explored in online- shopping settings.
  • This work can also apply to design industries, for example in generating 3D representations of example furniture placements.
  • content creation of augmented and virtual reality experiences requires long hours of cross platform development on current applications, so the system will allow faster scene generation and content generation in AR/VR experiences.
  • Semantic Scene Graphs form one part of the overall task of scene understanding.
  • the process calculates the average distance, Average Distance, between that object and all objects within that group. For cases where the object is a member of the group, the process does not count the distance between the object in question and itself in the average.
  • the process first defines an indicator function that is 1 if a ray extending from the center of the object in direction dk intersects the bounding box of a second object.
  • D P ( gi , * ) and D 0 (gi, * ) represent the collections of all feature vectors in (14) from objects in gi * .
  • the embodiments describe a scheme to label these axes such that the primary axis, a points in the direction the object is facing, a*. Since the process know that only one of these three axes has a z component, it shall store this in the third axis c and define b to be orthogonal to a on the x, plane. The box size r will also be updated to correspond to the correct axes. By constraining these axes to be right handed, for a given a* one has: (19)
  • the process uses the processed dataset as prior to train the SceneGen Knowledge Model.
  • the procedure first estimates each object Ok according to (14), and subsequently constructs Dp (gi,*) and Do (gi,*) in (15) for categories in G and Gasym respectively.
  • the process may not construct models for the Other’ category as objects contained in this category may be sparse and unrelated to each other.
  • the process estimates the likelihood functions P ( d P ( 0 )
  • KDE Kernel Density Estimation
  • the process utilizes a KDE library developed by Seabold and Perktold [Skipper Seabold and Josef Perktold 2010, statsmodels: Econometric and statistical modeling with python. In 9 th Python in Science Conference .] with a normal reference rule of thumb bandwidth with ordered, discrete variable types.
  • the process makes an exception for AverageDistance, which is continuous. When there are no objects of a certain group, gi in a room, the value of AverageDistance(Ok, gt) is set to a large constant (1000), and uses a manually tuned bandwidth (0.1) to reduce the impact of this on the rest of the distribution.
  • Figure 7 shows how implementation of SceneGen adds a new object to a scene.
  • SceneGen places objects into scenes by extracting a Scene Graph from each room as shown in Figure 3.
  • the Scene Graph has sampling position and orientations to create probability maps, then places an object in the most probable pose.
  • Each column represents a room and the placement of the object.
  • a sofa is placed in a living room
  • a sofa is placed in a living room
  • a chair is placed in an office
  • a table is placed in a dining room
  • a storage bin is placed in a bedroom.
  • Figure 8 shows examples of scenes augmented with multiple objects iteratively.
  • the inventors run a similar experiment to evaluate the orientation prediction models for Asymmetric objects.
  • the Scene Graphs capture 5 relationships based on the orientation of the objects: Facing (F), TowardsCenter (C), NextTo (NT), DirectionSimilarity (DS), and RoomPosition (RP).
  • the process assesses models based on several combinations of these relationships. [0099]
  • the process evaluates each of these models using the same K-fold approach, removing the orientation information of each object in the validation set, and then using the embodiments of the system to predict the best orientation, keeping the object’s position constant.
  • the process measures the angular distance between the system’s predictions and the original object’s orientation.
  • Levels I and II are both random placements, generated at run time for each user.
  • the Level I system initially places the object in a random position and orientation in the scene.
  • the Level II system places the object in an open random position and orientation, where the placement does not overlap with the room walls or other objects.
  • Levels III and IV use SceneGen predictions.
  • the Level III system places the object in the position and orientation predicted by SceneGen.
  • Level IV also places the object in the predicted position and orientation, but also overlays a probability map.
  • the Level V system places the object at the position it appears in the Matterport3D dataset, i.e., the ground truth.
  • SceneGen only produces the closest placements out of the system versions when considering the top five predictions.
  • SceneGen For pictures and tables, SceneGen’s top prediction is closest to ground truth, and is only slightly further when comparing the nearest of the top 5 predictions.
  • the Scene Graph introduced in the embodiments is designed to capture spatial relationships between objects, object categories and the room. Overall, it has been found that each of the relationships presented improves the model’s ability to augment virtual objects in realistic placements in a scene. These relationships are important to understand the functional purposes of the space in addition to the individual objects.

Abstract

A method of augmenting scenes with virtual objects includes accessing a target scene, extracting features from the target scene into a spatial scene graph representation of the target scene, generating one or more datasets from the spatial scene graph representation, using machine learning system to iteratively operate on the one or more datasets to by sampling positions and orientations in the target scene to create a probability map for placement of a virtual object in the scene, and predicting a viable placement for the virtual object in the target scene, producing a final scene.

Description

CONTEXTUAL AUGMENTATION USING SCENE GRAPHS
CROSS REFERENCE TO RELATED APPLICATION [0001] This application claims priority to and the benefit of US Provisional Application No. 63/043,904, filed June 25, 2020, which is incorporated herein by reference in its entirety.
TECHNICAL FIELD
[0002] This disclosure relates to spatial computing and computer-aided design techniques, more particularly to augmenting virtual objects to existing scenes.
BACKGROUND
[0003] Spatial computing experiences such as augmented reality (AR) and virtual reality (VR) have formed a new exciting market in today’s technological space. New applications and experiences launch daily across the categories of gaming and game engines, healthcare, design, education, architecture planning, computer graphics, 3D modeling and more. However, for all of the countless applications available, they have physical constraints caused by the geometry and semantics of the 3D user environment where existing furniture and building elements are present. Contrary to traditional 2D graphical user interfaces, where a flat rectangular region hosts digital content, 3D spatial computing environments are often occupied by physical obstacles that are diverse and often times non-convex. Therefore, how one can assess content placement in spatial computing experiences is highly dependent on the user’s target scene.
[0004] However, since different users may reside in different spatial environments, which differ in dimensions, functions (rooms, workplace, garden, etc.), and open usable spaces, existing furniture, developers often do not know the arrangements, making it very challenging to design a virtual experience that would adapt to all users’ environments. Therefore, current approaches address contextual placement by asking users themselves to identify the usable spaces in their surrounding environments or manually positioning the augmented object(s) within the scene.
[0005] Currently, virtual object placement in most AR experiences is limited to specific surfaces and locations, such as placing virtual objects naively in front of the user with no scene understanding, or only using basic horizontal or vertical surface detection. These simplistic strategies can work to some extent for small virtual objects, but the methods break down for larger objects or complex scenes with multiple object augmentation requirements. Remote, multiuser interaction scenarios further elevate this limitation, where finding a common virtual ground physically accessible to all participants to augment their content becomes challenging. Hence, such experiences automatically become less immersive once the users encounter implausible virtual object augmentation in their environments.
[0006] The task of adding objects to existing constructed scenes falls under the problem of constrained scene synthesis. Several examples exist of this approach. However, there are currently two major challenges in these current approaches that also create bottlenecks for virtual content augmentation in spatial computing experiences. First, current, publicly available, scanned 3D datasets are limited in size and diversity, and may not offer all the data required to capture topological properties of the rooms. For instance, pose, the direction in which the object is facing, is a critical feature for understanding the orientational property of an object. Many large-scale real -word datasets such as SUN RGB-D and Matterport3D do not clearly annotate pose for all objects. The SUN RGB-D (Scene Understanding Benchmark Red Green Blue - Depth) data set contains thousands of real RGB-D images of room scenes, with depth and where the D refers to depth. Matterport3D, by Matterport Inc., consists of a large-scale, RGB-D data set with annotation for surfaces, camera poses, and semantic segmentations, but not object poses. Therefore, more recent research has adapted synthetic datasets, which allows extraction of higher-level information such as pose as they do not necessarily need to be manually annotated.
[0007] However, synthetic datasets have a critical drawback because they cannot capture the natural transformation and topological properties of objects in real-world settings. Furniture in real-world settings results from gradual adoption of a space, contributing to the functionality of the room and surrounding items. Topological relationships between objects in real-world scenes typically exceed theoretical design assumptions of an architect, and instead capture contextual relationships from a living environment. Additionally, the limitations of the modeling software for synthetic datasets can also introduce unwanted biases to the generated scenes. The SUNCG (Scene Understanding Benchmark Computer Generated) dataset, for instance, was built with the Planner 5D platform, an online tool which any user around the world could use by agreement, but is currently unavailable. However, it comes with modeling limitations for generating rooms and furniture. Orientations also snap to right angles by default, which makes most scenes in the dataset Manhattan-like. More importantly, the system does not indicate if the design is complete or not, namely, a user may just start playing with the software and then leave at a random time, while the resulting arrangement is still captured as a legitimate human-modeled arrangement in the dataset.
[0008] Second, recent models take advantage of implicit deep learning models and have shown promising results in synthesizing indoor scenes. Their approach falls short for content developers to parameterize customized placement in relation to standard objects in the scene, and to generate custom spatial functionalities. One major limitation of these studies lies in their lack of direct control over objects in the generated scene. For example, some researches have reported they cannot specify object counts or constrain the scene to contain a subset of objects. Such limitations come from the implicit nature of such networks. Implicit models produce a black-box tool, making it difficult to comprehend should an end-user wishes to tweak its functions. In cases where new objects are set to be placed, implicit structures may not provide abilities to manually define new object types. Additionally, using deep convolution networks require large datasets to train, a bottleneck discussed above.
BRIEF DESCRIPTION OF THE DRAWINGS [0009] Fig. 1 shows an embodiment of a SceneGen framework used to augment scenes with virtual objects.
[0010] Fig. 2 shows an embodiment of a workflow for SceneGen.
[0011] Fig. 3 shows an embodiment of a Scene Graph.
[0012] Fig. 4 shows examples of placement choices for objects with different topological relationships.
[0013] Fig. 5 shows an embodiment of a camera orbit in an annotation tool.
[0014] Fig. 6 shows embodiments of an annotation tool.
[0015] Fig. 7 shows embodiments of SceneGen placing objects into scenes.
[0016] Fig. 8 shows embodiments of SceneGen iteratively adding multiple virtual objects to a scene.
[0017] Fig. 9 shows visualizations of a Knowledge Model built from Scene Graphs.
[0018] Fig. 10 shows a graph of distance between ground truth object’s position and SceneGen predictions of position.
[0019] Fig. 11 shows a distance between the ground truth object’s position and the nearest of the 5 high probability positions predicted by SceneGen.
[0020] Fig. 12 shows a plot of angular distance between the ground truth orientation and the top orientation prediction of SceneGen. [0021] Fig. 13 shows scenes of different levels.
[0022] Fig. 14 shows a graph of the result of users rating plausibility of object placements. [0023] Fig. 15 shows the plausibility scores for each object category.
[0024] Fig. 16 shows radial histograms displaying distributions of how much a user rotated on object during a user study.
[0025] Fig. 17 shows a cumulative density plot of the distance an object was moved from its placement for different levels.
[0026] Fig. 18 shows a top 5 highest probability for SceneGen and users’ position for placing objects.
[0027] Fig. 19 shows an example of an augmented reality application using SceneGen.
[0028] Fig. 20 shows an embodiment of a system.
DETAILED DESCRIPTION OF THE EMBODIMENTS [0029] The embodiments here introduce a generative contextual augmentation framework, referred to here as SceneGen, which provides probability maps for virtual object placements. Given a non-empty room already occupied by furniture, SceneGen provides a model-based solution to add new objects in functional placements and orientations. Figure 1 shows two different process flows, one for one derived from an actual target scene, and another derived from a previously scanned scene.
[0030] In the top process flow, an augmented reality camera such as RGB-D (red, green, blue and depth) sensors in a room, or an augmented reality headset, scans the target scene 10. The depth sensors could comprise one or more of motion tracking sensors, optical sensors, accelerometers, Global Positioning System sensors, gyroscopes, solid state compasses, radio- frequency identification (RFID) etc. Examples of these sensors are shown in Fig. 6. One should note that the term “scene” may be a room or an open area, both of which may be referred to as a room. No limitation to an enclosed room is intend, nor should any be implied. Semantic segmentation produces an existing scene 12, which in the embodiments here means generating a Scene Graph discussed in more detail below. Virtual objects such as 14 are to be inserted into the existing scene, more than likely for rendering on a display in the augmented reality, or virtual reality, scene. Using the embodiments here, a series of ‘heat maps,’ essentially probability maps of different placements of the object such as 16. The virtual object is then placed into the Scene Graph at 18. This then results in the placement in the AR scene at 20.
[0031] One should note that while the virtual objects shown in Figure 1 are furniture, other virtual objects may be used. Windows, paintings, plants, etc., may all be placed in the augmented reality scene, as examples, One such example could include an interactive or streaming object, such as a newspaper having changeable content, or a TV show being streamed on a TV in the room. Other objects may include advertisements, interactive or not, for products or services.
[0032] The bottom process flow shows the same process but using a modified previously scanned target scene 22 to visualize the performance of SceneGen. Target scene 22 is generated by removing objects 26 from the previously scanned ground truth scene 32 which is available in the Matterport3d dataset. Modified target scene 22 is then segmented to produce a semantic segmentation in accordance with Scene Graph 24, into which virtual objects 26 are to be inserted. The SceneGen heat map 28 shows the probability placement, and the placement of the virtual objects into the SceneGen scene at 30. The resulting scene 30 can be compared with the previously scanned ground-truth scene 32, showing that while placements may differ, SceneGen is able to place virtual objects in a contextual fashion. [0033] The embodiments also propose an interactive generative system to model the surrounding room. Contrary to the unintuitive implicit models, SceneGen is based on clear, logical object attributes and relationships. In light of the existing body of literature on semantic scene graphs, the embodiments leverage this approach to encapsulate the relevant object relationships for scene augmentation. Scene Graphs can inform the intelligent placement of virtual objects in physical scenes and will typically be the initial part of the SceneGen process.
[0034] The embodiments use kernel density estimation (KDE) to build a multivariate conditional model to encapsulate explicit positioning and clustering information for object and room types. This information will allow the process to determine likely locations to place a new object in a scene while satisfying the physical constraints. Object orientations are predicted using a probability distribution. From the calculated probabilities, the process generates a score for each potential placement of the new object, visualized as a heat map over the room. The system is user-centric and ensures that the user understands the influence of data points and object attributes on the results.
[0035] Recent work has produced extensive scans of real-world environments. The embodiments discussed below use one such dataset, Matterport3D, in place of synthetic datasets such as SUNCG. However, any real dataset or synthetic dataset will work. As a trade-off, the real-world environment data are prone to messy object scans and non- Manhattan alignments. A Manhattan-world assumption states that all surfaces in the world align with three dominant directions, such as the X, Y, and Z axes. The dataset will be referred to as the environmental dataset.
[0036] The embodiments introduce a spatial Scene Graph representation that encapsulates positional and orientational relationships of a scene, different from previous scene graphs. The Scene Graph of the embodiments captures pairwise topology between objects, object groups, and the room.
[0037] The embodiments develop a prediction model for object contextual augmentation in existing scenes. They construct an explicit Knowledge Model that is trained from Scene Graph representations captured from real-world 3D scanned data.
[0038] To learn orientational relationships from real-world 3D scanned data, the process of the embodiments have labeled the environmental dataset with pose directions by humans. To do so, the embodiments include a labeling tool for fast pose labeling.
[0039] The embodiment include an Augmented Reality (AR) application that scans a user’s room and generates a Scene Graph based on the existing objects. Using the model, the process samples poses across the room to determine a probabilistic heat map of where the object can be placed. By placing objects in poses where the spatial relationships are likely, the embodiments are able to augment scenes that are realistic.
[0040] The system of the embodiments can facilitate a wide variety of AR/VR applications. For purposes of this discussion, the term augmented reality (AR) will encompass both Augmented and virtual augmentations in Virtual Reality environment. For example collaborative environments require placing one user’s objects into another user’s surroundings. More recently, adding virtual objects to scenes has been explored in online- shopping settings. This work can also apply to design industries, for example in generating 3D representations of example furniture placements. In addition, content creation of augmented and virtual reality experiences requires long hours of cross platform development on current applications, so the system will allow faster scene generation and content generation in AR/VR experiences. [0041] Semantic Scene Graphs form one part of the overall task of scene understanding.
Given visual input, as AR experiences generally would receive, one can tackle the tasks of 3D scene reconstruction and visual relationship detection. On the latter topic, a progression of approaches have attempted to encapsulate human "common-sense" knowledge in various ways, including physical constraints and statistical priors, physical constraints and stability reasoning, physics-based stability modeling, language priors, and statistical modeling with deep learning. Another approach takes advantage of the regularity and repetition of furniture arrangements in certain indoor spaces, such as office buildings. Another proposed technique potentially well suited for AR applications builds a 3D reconstruction of the scene through consecutive depth acquisitions, which could be taken incrementally as a user moves within their environment. Recent work has addressed problems like learning 3D scene layouts from 2D panoramic input, building scenes from 3D point clouds, and 3D plane reconstruction from a single image. The embodiments leverage these approaches on scene understanding, because the model operates on the assumption that one already has locations and bounding boxes of the existing objects in scene. However, the assumptions do not necessarily rely upon these approaches, other sources of the assumptions could be used.
[0042] Semantic Scene Graphs have been applied to various tasks in the past, including image retrieval, visual question answering, image caption generation, and more. The past research can be divided into two approaches: (1) separate stages of object detection and graph inference, and (2) joint inference of object classes and graph relationships. The first approach often leverages existing object detection networks. Similarly to other scene understanding tasks, many methods also involved learning prior knowledge of common scene structures in order to apply them to new scenes, such as physical constraints from stability reasoning or frequency priors represented as recurring scene motifs. Most methods were benchmarked based on the Visual Genome dataset, a dataset that not only identifies objects in an image, but also relationships between the objects.
[0043] However, recent studies found this dataset to have an uneven distribution of examples across its data space. In response, some have proposed new networks to draw from an external knowledge base and to utilize statistical correlations between objects and relationships, respectively. The work focuses on the task of construction and utilization of the semantic Scene Graph. As in some approaches, the embodiments also use statistical relationships and dataset priors, but the embodiments use a mathematical model rather than deep learning.
[0044] The general goal of indoor scene synthesis is to produce a feasible furniture layout of various object classes that address both functional and aesthetic criteria. Early work of synthetic generation focused on hard-coding rules, guideline and grammars, resembling a procedural approach for this problem. Some approaches used hard-coding design guidelines as priors for the scene generation process, extracted by consulting manuals on furniture layout and interviewing professional designers who specialize in arranging furniture. A similar approach attempted synthesizing open world layouts with hard-coded factor graphs. [0045] One approach synthesized scenes by training to build a probabilistic model based on Bayesian networks and Gaussian mixtures, but had issues with generating an entire scene, and utilized a more limited set of input example scenes. Another synthesized a full 3D scene iteratively by adding a single object at a time. This system learned some priors similar to the embodiments, including pairwise and higher-order object relations. Compared to this work, the embodiments incorporate additional priors, including objects’ relative position within the room bounds. [0046] Other approaches also took room functions into account, and positing that extracting topological priors should extend to room functions and their activities, which would impact the pair-wise relationships between objects. While object topologies differ in various room function, a major challenge in this approach is that not all spaces can be classified with a certain room function. For instance, in a small studio apartment, the living room might serve additional functions such as dining room and a study space. Another approach proposed a similar approach, involving a Gaussian mixture model and kernel density estimation. However, that approach targeted an inverse problem that the embodiments have, namely, their problem received a selected object location as input and was asked to predict an object type. The inventors find the problem to be more relevant to the needs of a content creator who knows what object they wish to place in scene, but does not have prior knowledge about the user’s surroundings.
[0047] Another data-driven approach to scene generation involves modeling human activities and interactions with the scene, generally seeking to model and adjust the entire scene according to human actions or presence. There have also been a number of interesting studies that take advantage of logical structures modeled for natural language processing (NLP) scenarios.
[0048] More specifically, one approach bears a minor resemblance to the approach, in 1) training on object relations, and 2) the ability to augment an initial input scene [Rui Ma, Akshay Gadi Patil, Matthew Fisher, Manyi Li, Soren Pirk, Binh-Son Hua, Sai-Kit Yeung,
Xin Tong, Leonidas Guibas, and Hao Zhang, Language-driven synthesis of 3D scenes from scene databases. In SIGGRAPH Asia 2018 Technical Papers. ACM, 212], In contrast to the embodiments, the approach augments scenes by merging in subscenes retrieved from a database. [0049] More recent work endeavors to improve learning-based methods, using deep convolutional priors, scene autoencoding, and new representations of object semantics, to name just a few. One approach addressed a related but distinct problem of synthesizing a scene by arranging and grouping an input set of objects. Another example uses deep generative models for scene synthesis. This method sampled each object attribute with a single inference step to allow constrained scene synthesis. One extension of this work proposed a combination of two separate convolutional networks to address constrained scene synthesis problems, arguing that object-level relationships facilitate high-level planning of how a room should be laid out, while room-level relationships perform well at placing objects in precise spatial configurations.
[0050] In contrast, the process of the embodiments seeks to add in individual objects, which is more aligned with the needs of creators of augmented reality experiences. One approach proposed generating a 3D scene representation by recreating the scene from RGB-D image input, using retrieved and aligned 3D models, but does not handle adding new objects. The embodiments here differ from the discussed studies in 1) utilizing an explicit model rather than an implicit structure, 2) taking advantage of higher level relationships with the room itself in the proposed Scene Graph, and 3) generating a probability map which would guide the end user on potential locations for object augmentation.
[0051] The embodiments here, SceneGen, provides a framework to augment scenes with virtual objects using a generative model to maximize the likelihood of the relationships captured in a spatial Scene Graph. Specifically, if given a partially filled room, SceneGen, will augment it with one or multiple new virtual objects in a realistic manner using an explicit model trained on relationships between objects in the real world. The SceneGen workflow is shown in Figure 2. [0052] In Figure 2, the portion of the process within box 42 show the training process. The portion within box 42 shows the creation of the knowledge model. The white boxes such as 46 show the test time procedure of sampling, and the gray boxes such as 44 show the application of the model.
[0053] The embodiments introduce a novel spatial Scene Graph that converts a room and the objects included in it to a graphical representation using extracted spatial features. A Scene Graph is defined by nodes representing objects, object groups, and the room, and by its edges representing the spatial relationships between the nodes. While various objects hold different individual functions, such as a chair to sit, a table to dine, etc., their combinations and topological relationships tend to generate the main functional purpose of the space.
[0054] In other words, spatial functions are created by the pair-wise topologies of objects and their relationship with the room. In the proposed Scene Graph representation, the process will explicitly extract a wide variety of positional and orientational relationships that can be present between objects. The process model descriptive topologies that are commonly utilized by architects and interior designers to generate spatial functionalities in a given space. Therefore, the Scene Graph representation embodiments can also be described as a function map, where objects, or nodes, and their relationships, edges, correspond to a single or multiple spatial functionalities present in a scene. Figure 3 illustrates two examples of the Scene Graph representation, where a subset of topological features are visualized in the graph.
[0055] The discussion here considers a room or a scene in 3D space where its floor is on the flat (x, >')-plane and the z-axis is orthogonal to the (x,y)-plane. In this orientation, the process denotes the room space in a floorplan representation as R such as in 50 and 52, namely, an orthographic projection of its 3D geometry plus a possible adjacency relationship that objects in R may overlap on the (x.y)- plane but on top of one another along the z-axis. Specifically, the “support” relationship is defined in below. This can also be viewed as a 2.5-D representation of the space.
[0056] Further denote the k- th object, such as a bed or a table in R as Ok. The collection of all n objects in R is denoted as O = {Oi, O2 ...}. B(Ok ) represents the bounding box of the object Ok. Ok represents the center of the object Ok. Every object Ok has a label to classify its type. Related to the same R, there is also a set of groups G = {g\,... gm}, where each group gi contains all objects of the same type within R.
[0057] Furthermore, each Ok has a primary axis ak and a secondary axis bk. For Asymmetric objects, ak represents the orientation of the object. Vectors ak and bk are both unit vectors such that bk is a p/2 radian counter clockwise rotation of ak. The process defines θak and θbk to be the angle in radians represented by ak and bk respectively.
[0058] For each room R, the process defines W = { W 1. W2, ... } where each Wk is a wall of the /-sided room. In the floor plan representation, is represented by a ID line segment. The process also introduces a distance function d (a, b ) as the shortest distance between a and b objects. For example, d ( B(Ok ), R) is the shortest distance between the bounding box of Ok and the center of the room R.
[0059] The process first introduces features for objects based on their spatial positions in a scene. The process includes both pairwise relationships between objects, such as between a chair and a desk, object groups such between a dining table and dining chairs, and relationships between an object and the room.
[0060] The room position feature, RoomPosition, of an object denotes whether an object is at the middle, edge, or comer of a room. This is based on how many walls an object is less than r distance from:
Figure imgf000017_0001
In other words, if RoomPosition {Ok, R ) > 2, the object is near at least 2 walls of a room, and hence is near a comer of the room; if RoomPosition {Ok, R) = 1, the object is near only one wall of the room and is at the edge of the room; otherwise, the object is not near any wall and is in the middle of the room.
[0061] For each object, and each group of objects the process calculates the average distance, Average Distance, between that object and all objects within that group. For cases where the object is a member of the group, the process does not count the distance between the object in question and itself in the average.
Figure imgf000017_0002
For each object, and each group of objects, the process computes how many objects in the group are within a distance e of the object, called here SurroundedBy. For cases where the object is a member of the group, the process does not count the object in question.
Figure imgf000017_0003
[0062] An object is considered to be supported by a group if is directly on top of an object from the group, or supports a group if it is directly underneath an object from the group.
Figure imgf000017_0004
(4) [0063] The process categorizes the objects in the scenes into three main groups. The first group is Gsym that includes Symmetric objects such as coffee tables and house plants that have no clear front-facing direction. The second group is Gasym that includes Asymmetric objects such as beds and chairs that can be oriented to face in a specific direction. The third group is Gin that includes Inside Facing objects such as paintings and storage that are always facing opposite to the wall of the room where they are situated.
[0064] The below discussion covers features applicable to objects with a defined facing decisions, and not for symmetric objects. The processor first defines an indicator equation that is 1 if a ray extending from the center in the direction dk of an object intersects a wall W\.
Figure imgf000018_0001
An object is considered to be facing towards the center of the room, TowardsCenter, if a ray extending from the center of the object intersects one of the furthest ½ walls from the object. (6)
Figure imgf000018_0002
TowardsCenter(
Figure imgf000018_0003
(O
[0065] An object is considered facing away from a wall if it is oriented away from and is normal to the closest wall to the object.
Figure imgf000018_0004
Figure imgf000018_0005
[0066] An object has a similar direction as one or more objects within a constant e distance from the object if the other objects are facing in the same direction or in the opposite direction (p radians apart) from the first object subject to some small angular error f.
Figure imgf000019_0001
(9)
[0067] The process first defines an indicator function that is 1 if a ray extending from the center of the object in direction dk intersects the bounding box of a second object.
Figure imgf000019_0002
Between an object and a group of objects the process counts how many objects of the group are within a distance e of the object and are in the direction of the primary axis of the first object. ( | | )
Figure imgf000019_0003
[0068] Between an object and a group of object the process counts how many objects of the group are within a distance e of the object and are in the direction of the positive or negative secondary axis of the first object.
NextTo
Figure imgf000019_0004
[0069] To evaluate the plausibility of a new arrangement, the process compares its corresponding Scene Graph 54 corresponding to scene 50 and Scene Graph 56 corresponding to scene 56 with a population of viable Scene Graphs priors. By extracting Scene Graphs from a corpus of rooms, the process construct a Knowledge Model that serves as the spatial priors for the position and orientation relationships of each object group. For each object instance, the process assembles a data vector for positional features from G. For Asymmetric objects, the process similarly creates a data vector for orientational features. First the process defines the following that represent an object’s relationships with all groups,
Figure imgf000020_0003
Figure imgf000020_0002
[0070] This allows construction of data arrays, , containing features that
Figure imgf000020_0004
relate to the position and orientation of an objects respectively. RoomPosition is also included in the data array for orientational features, do, since the other features of do are strongly correlated with an object’s position in the room. This is abbreviated as RP. The process also uses the abbreviate TowardsCenter to TC and DirectionSimilarity to DS. For succinctness, when using these abbreviations for the features, the parameter Ok is dropped from the notation.
Figure imgf000020_0001
[0071] Finally, given one feature vector per object for position and orientation, respectively, one can collect more samples from a database, discussed below, to form the Knowledge Model. The model collects feature vectors separately with respect to different object types in multiple room spaces. To do so, the process introduces gij to collect all of the i-th type objects in room R . Without loss of generality, one can assume that the i-
Figure imgf000020_0005
th object type is the same across all rooms. Therefore, one can collect all the objects of the same i-th type from a database as
Figure imgf000021_0005
[0072] Then DP ( gi ,*) and D0 (gi,*) represent the collections of all feature vectors in (14) from objects in gi*.
Figure imgf000021_0002
[0073] Given the feature samples for the same type of object in (15), now the process can estimate their likelihood distribution. In particular, given an object placement O of the i-th type, the process seeks to estimate the likelihood function for its position features:
Figure imgf000021_0003
[0074] If O is asymmetric, the process also seeks to estimate the likelihood function for its orientation features:
Figure imgf000021_0004
However, if 0 is an Inside Facing object, then with certainty its orientation will be determined by that of its adjacent wall. Additionally, if 0 is a Symmetric object, it has no clear orientation. Therefore, for these categories of objects, estimation of their orientation likelihood is not needed.
[0075] One can approximate the shape of these distributions using multivariate kernel density estimation (KDE). Kernel density estimation is a non-parametric way to create a smooth function approximating the true distribution by summing kernel functions, K, placed at each observation Xi ...Xn.
Figure imgf000021_0001
This allows the process to estimate the probability distribution function (PDF) of the position and orientation relationships from the spatial priors in the Knowledge Model, Dp (gi*), D0 ( gi *) for each group gi.
[0076] One embodiment of the SceneGen process is shown below. Given a room model R and a set of existing objects O = {O i, O 2, the process evaluates the position and orientation likelihood of augmenting a new object O' and recommends its most likely poses.
Figure imgf000022_0001
[0077] Figure 4 shows how potential scene graphs are created for sampled placements. For scenes where multiple objects need to be added, the process repeats the above process for each additional object.
[0078] The implementation detail of SceneGen framework may be based on the relationship data learned from the environmental dataset. The Matterport3D dataset, for example used as the environmental dataset here is a large-scale RGB-D dataset containing 90 building-scale scenes. The dataset consists of various building types with diverse architecture styles, including numerous spatial functionalities and furniture layouts. Annotations of building elements and furniture have been provided with surface reconstruction as well as 2D and 3D semantic segmentation.
[0079] In order to use the Matterport3D dataset as priors for SceneGen of the embodiments, one may need to make a few modifications to standardize object orientations using an annotation tool developed by the inventors. In particular, the annotation tool of the embodiments interacts with the dataset in a fully 3D environment. After the annotation, the relationship data then are consolidated back to the 2.5-D representation conforming to the computation of the SceneGen models.
[0080] For each object Ok, the environmental dataset supplies labeled oriented 3D bounding boxes B(O) aligned to the (x.y)-plane. This is defined by a center position O, primary axis a, secondary axis b, an implicit tertiary axis c, and denotes the bounding box size of O
Figure imgf000023_0002
divided in half. However, the environmental dataset may not provide information about which labeled direction the object is facing or aligns with the z-axis. Hence, it will may rely on the labeling tool to resolve the ambiguities.
[0081] To provide a consistent definition, the embodiments describe a scheme to label these axes such that the primary axis, a points in the direction the object is facing, a*. Since the process know that only one of these three axes has a z component, it shall store this in the third axis c and define b to be orthogonal to a on the x, plane. The box size r will also be updated to correspond to the correct axes. By constraining these axes to be right handed, for a given a* one has: (19)
Figure imgf000023_0001
[0082] In order to correctly relabel each object, the inventors have developed an application to facilitate the identification of the correct primary axis for all Asymmetric objects and supplemented this to the updated data set.
[0083] For each object, the process views the house model mesh at different camera positions such as 60 around the bounding box 62, or the camera position 64 and bounding box 66, in order to determine the primary axis of the object as displayed in Figure 5. Figure 6 shows an embodiment of an annotation tool that allows a labeling user to select from two possible directions at each camera position or can move the camera clockwise or counter clockwise to get a better view. Once a selection is made, the orienting axis, a* can be determined by knowing which camera is being looked at and the direction selected. The process then uses equation (19) to standardize the axes. Using the annotation tool, the orientations of all objects in a typical house scan can be labeled in about 5 minutes. One example of labeling is shown in Figure 6.
[0084] For this study, the process then reduced the categories of object types considered for building the model and placing new objects. Though the environmental dataset includes many different types of furniture, organized with room labels to describe furniture function such as "dining chair" vs. "office chair", the dataset may have a limited amount of instances for many object categories. Because the process builds statistical models for each object category, the process requires an adequate representation of each category. The process reduces the categories to a better-represented subset for the purposes of this study.
[0085] The process groups the objects into 9 broader categories: G = {Bed, Chair, Decor, Picture, Sofa, Storage, Table, TV, Other }. Each of these categories has a specific type of orientation, as described in above. Of these categories, Asymmetric objects are Gasym = {Bed, Chair, Sofa, TV}, Symmetric objects are GSym = {Decor, Table}, and Inside Facing objects are Gin = {Picture, Storage}.
[0086] For room types, the process considers the set {library, living room, meeting room, TV room, bedroom, rec room, office, dining room, family room, kitchen, lounge} to avoid overly specialized rooms such as balconies, garages and stairs. The process also manually eliminates unusually small or large rooms with outlier areas and rooms where scans and bounding boxes are incorrect.
[0087] In one embodiment, after the data reduction to eliminate redundant and irrelevant data, the process considers a total of 1,326 rooms and 7,017 objects in the training and validation sets. The object and room categories used can be expanded if sufficient data is available. One should note that fewer or more categories for objects and rooms may be employed depending upon any data set requirements or restrictions.
[0088] The process uses the processed dataset as prior to train the SceneGen Knowledge Model. The procedure first estimates each object Ok according to (14), and subsequently constructs Dp (gi,*) and Do (gi,*) in (15) for categories in G and Gasym respectively. The process may not construct models for the Other’ category as objects contained in this category may be sparse and unrelated to each other. Given the priors, the process estimates the likelihood functions P ( dP ( 0 ) |DP (gi,*)) and P ( d0 ( 0 ) |D0 (gi,*)) from (16) and (17) using Kernel Density Estimation (KDE).
[0089] The process utilizes a KDE library developed by Seabold and Perktold [Skipper Seabold and Josef Perktold 2010, statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference .] with a normal reference rule of thumb bandwidth with ordered, discrete variable types. The process makes an exception for AverageDistance, which is continuous. When there are no objects of a certain group, gi in a room, the value of AverageDistance(Ok, gt) is set to a large constant (1000), and uses a manually tuned bandwidth (0.1) to reduce the impact of this on the rest of the distribution. [0090] The inventors found for this particular dataset, a subset of features, Facing, TowardsCenter and RoomPosition, have the most impact in predicting orientation as detailed in below. Therefore, while the embodiments model all of the orientational features, it only uses the Facing, TowardsCenter and RoomPosition features for this implementation of SceneGen and in the User Studies. Finally, due to overlapping bounding boxes in the dataset, calculating object support relationships (SP) precisely is not possible. The embodiments allow the certain natural overlaps defined heuristically instead of using these features. A visualization of the priors from the environmental dataset can be seen in Figure 9.
[0091] The embodiments use the SceneGen process to augment a room R with an object of type i and generate a probability heat map. This can be repeated in order to add multiple objects. To speed up computation in this implementation, the process first samples positions, and then samples orientations at the most probable position, instead of sampling orientations at every possible position.
[0092] Figure 7 shows how implementation of SceneGen adds a new object to a scene. SceneGen places objects into scenes by extracting a Scene Graph from each room as shown in Figure 3. The Scene Graph has sampling position and orientations to create probability maps, then places an object in the most probable pose. Each column represents a room and the placement of the object. In (a) a sofa is placed in a living room, in (b) a sofa is placed in a living room, in (c) a chair is placed in an office, in (d) a table is placed in a dining room, and in (e) a storage bin is placed in a bedroom. Figure 8 shows examples of scenes augmented with multiple objects iteratively.
[0093] If machine learning is used, the process trains and evaluates the model using a machine with a 4-core Intel i7-4770HQ CPU and 16GB of RAM. In training, creating the Knowledge Model and estimating distributions for 8 categories of objects takes approximately 12 seconds. In testing, it takes ~2 seconds to extract a scene graph and generate a heat map indicating the probabilities of 250 sampled poses.
[0094] To evaluate the prediction system, the process runs ablation studies, examining how the presence or absence of particular features affects the object position and orientation prediction results. The process uses a K=4-fold cross validation method on the ablation studies, with 100 rooms in each validation set and the remaining rooms in the training set. [0095] The full position prediction model, SceneGen, trains three features: AverageDistance (AD), SurroundedBy (S), RoomPosition (RP) or AD+S+RP in short. The process creates three reduced versions of the system: AD+RP, using only AverageDistance and RoomPosition features; S+RP, using only SurroundedBy and RoomPosition features; and RP, solely using the RoomPosition feature.
[0096] The process evaluates each system using the K-fold method described above. In this study, it removes each object in the validation set, one at a time, and uses the model to predict where the removed object should be positioned. The orientation of the replaced object will be the same as the original. The process computes the distance between the original object location and the system’s prediction.
[0097] However, as inhabitants of actual rooms, the inventors are aware that there is often more than one plausible placement of an object, though some may be more optimal than others. Thus, the inventors raise the question of whether there is more than one ground truth or correct answer for the object placement problem. Hence, in addition to validating the model’s features, the first ablation study validates them in relation to the simple approach of taking the single highest-scored location from the system. Meanwhile, the second ablation study uses the top 5 highest-scored locations, opening up examination to multiple potential "right answers".
[0098] The inventors run a similar experiment to evaluate the orientation prediction models for Asymmetric objects. The Scene Graphs capture 5 relationships based on the orientation of the objects: Facing (F), TowardsCenter (C), NextTo (NT), DirectionSimilarity (DS), and RoomPosition (RP). The process assesses models based on several combinations of these relationships. [0099] The process evaluates each of these models using the same K-fold approach, removing the orientation information of each object in the validation set, and then using the embodiments of the system to predict the best orientation, keeping the object’s position constant. The process measures the angular distance between the system’s predictions and the original object’s orientation.
[0100] The inventors conducted user studies with a designed 3D application based on the prediction system to evaluate the plausibility of the predicted positions and the usefulness of the heat map system of the embodiments. The inventors recruited 40 participants, of which 8 were trained architects. To ensure unbiased results, the participants were randomly divided into 4 groups. Each group of users were shown 5 scenes from each of the 5 levels for a total of 25 scenes. The order these scenes were presented in was randomized for each user and they were not told which level a scene was at.
[0101] The inventors reconstructed 343D scenes from the dataset test split, where each scene had one object randomly removed. In this reconstruction process, the inventors performed some simplification and regularized the furniture designs using prefabricated libraries, so that users would evaluate the layout of the room, rather than the design of the object itself, while matching the placement and size of each object. An example of this scene reconstruction and simplification can be seen in Figure 13(a-b).
[0102] The five defined levels test different object placement methods as shown in Figure 13(c-g) to replace the removed object. Levels I and II are both random placements, generated at run time for each user. The Level I system initially places the object in a random position and orientation in the scene. The Level II system places the object in an open random position and orientation, where the placement does not overlap with the room walls or other objects. Levels III and IV use SceneGen predictions. The Level III system places the object in the position and orientation predicted by SceneGen. Level IV also places the object in the predicted position and orientation, but also overlays a probability map. The Level V system places the object at the position it appears in the Matterport3D dataset, i.e., the ground truth. [0103] The inventors recorded the users’ Likert rating of the plausibility of the initial object placement in a scale of 1 to 5 (1 = implausible/random, 3 = somewhat plausible, 5 = very plausible). They also recorded whether the user chose to adjust the initial placement, the Euclidean distance between the initial placement and the final user-chosen placement, the orientation change between the initial orientation and the final user-chosen orientation. The expectation was for higher initial Likert ratings and smaller adjustments to position and orientation for levels initialized by the system than for levels initialized to random positions. [0104] Each participant used an executable application on a desktop computer. The goal of the study was explained to the user and they were shown a demonstration of how to use the interface. For each scene, the user was shown a 3D room and an object that was removed. After inspecting the initial scene and clicking "place object", the object was placed in the scene using the method corresponding to the level of the scene. In Level IV Scenes, the probability heat map was also visualized. The user was shown multiple camera angles and was able to pan, zoom and orbit around the 3D room to evaluate the placement.
[0105] The user was first asked to rate the plausibility of placement on a Likert Scale from 1- 5. Following this, the user was asked if they wanted to move the object to a new location. If they answered "no", the user would progress to the next scene. If they answered "yes", the UI displayed transformation control handles (position axis arrows, rotation axis circles) to object position and orientation. After moving the object to the desired location, the user could save the placement and progress to the next scene. An IRB approval was maintained ahead of the experiment. [0106] In this experiment, the inventors remove objects from test scenes taken from the environmental dataset and replace it using various versions of the model in an ablation study. In Figure 10, the process plots the cumulative distance between the ground truth position and the top position prediction, and in Figure 11, the process plots the cumulative distance between the ground truth position and the nearest out of the top 5 position predictions, using the full system and three ablated versions.
[0107] The inventors found that the full SceneGen system predicts a placement most similar to ground truth than any of the ablated versions, followed by the models using A verageDist and RoomPosition features (AD+RP), and SurroundedBy and RoomPosition (S+RP). The predictions furthest from the ground truth are generated by only using the RoomPosition (RP) feature. These curves are consistent between the best and the closest of the top 5 predicted positions and indicate that each of the features for position prediction contributes to the accuracy of the final result.
[0108] In addition, when the top 5 predictions are considered, one can see that each system assessed is able to identify high probability zones closer to the ground truth. This is supported by the slope of the curves in Figure 11, evaluating the closest of the top 5 predictions, which rise much more sharply than in Figure 10, using the only best prediction. This difference provides support for the importance of predicting multiple locations instead of simply returning the highest-scored sampling location. A room can contain multiple plausible locations for a new object, so the system’s most highly scored location will not necessarily be same as the ground truth’s position. For this reason, the system returns probabilities across sampled positions using a heat map to show multiple viable predictions for any placement query. [0109] Table 1, below, shows the mean distance of the position prediction to ground truth position separated by object categories. It was found that the object categories where the full SceneGen system outperforms its ablations are chairs, storage, and decor. For beds and TVs, SceneGen only produces the closest placements out of the system versions when considering the top five predictions. For pictures and tables, SceneGen’s top prediction is closest to ground truth, and is only slightly further when comparing the nearest of the top 5 predictions.
Figure imgf000031_0001
Table 1
[0110] As with the position ablation studies, the system assesses the ability of various versions of the model to reorient asymmetric objects from test scenes. In Figure 12, the angular distance is plotted between the ground truth orientation and the top orientation prediction, various versions of the system. The base model includes only Facing, (F), and is the lowest performing. The inventors found that the system that also includes TowardsCenter and RoomPosition features performs best overall. The process uses this system (F+C+RP) in the implementation of SceneGen. The other four versions of the system perform similarly to each other overall.
[0111] Table 2 shows the results of the orientation ablation study separated by object category. In this case, the system with Facing, TowardsCenter and RoomPosition features
(F+C+RP) outperforms all other versions across on all categories except for TVs, where the system that includes Facing, TowardsCenter and NextTo (F+C+NT) produces the least deviation. In fact, all three of the systems that included either DirectionSimilarity or NextTo, predict the orientation of TVs more closely than the overall best performing system, but perform more poorly on other objects such as beds when compared with systems without those features. This suggests that for other datasets, these features could be more effective in prediction orientations.
Figure imgf000032_0001
Table 2
[0112] Figure 14 shows the distributions of Likert ratings by level. The inventors also ran a one-way ANOVA test on the Likert ratings of initial placements, finding significant differences between all pairs of levels except for Levels IV and V. In other words, the ratings for Level IV’s representation of the prediction system are not significantly different from ground truth placements. Across multiple tests, one can see that Level IV result means are significantly different from levels based on randomization, while Level III is only sometimes. As the Level IV presentation of the system can have multiple suggested initial placements, this difference between Levels III and IV could support the conjecture that accounting for multiple “right answer” placements improves the predictions.
[0113] One can also analyze how participants’ choices to adjust placement and amount moved varied across different scene levels. Results of this can be seen in Figure 17. A oneway ANOVA test of the distance users moved objects from its placements found a significant difference (p = 1.8622e38) between two groupings of levels: 1) Levels I and II (with higher means), and 2) [0114] Levels III, IV, and V (with lower means). This first group contains the levels with randomized initial placements, while this second group contains the levels that use the prediction system of the embodiments or the ground truth placement. The differentiation in groupings provides support for the plausibility of the system of the embodiment’s position predictions over random placements.
[0115] A one-way ANOVA test was performed on the overall change in object orientation from the participants’ manual adjustment, and found a significant difference (p = 1.8112el6) between a different pair of level groupings: 1) Levels I, II, and III, and 2) Levels IV and V. Figure 16 shows the distributions of angular distance between the initial object orientation and the final user-chosen orientation, for each level. The levels IV and V have distributions are most concentrated at no rotation by the user. In Levels I and II, the users rotate objects more than half of the time, with an average rotation greater than p/6 radians. A vast majority of objects placed by Levels III, IV, and V systems are not rotated by the user, lending support to the validity of the prediction system of the embodiments.
[0116] To demonstrate a way to integrate the prediction system embodiments in action, the inventors have implemented an augmented reality application that augments a scene using SceneGen. Users can overlay bounding boxes over the existing furniture to see the object bounds used in the predictions. On inserting a new object to the scene, the user can visualize a portability map to observe potential positions. The Augmented Reality application of the embodiment consists of five main modules: (i) local semantic segmentation of the room; (ii) local Scene Graph generation (iii) heat map generation which is developed on an external server (iv) local data parsing and visualization; and finally (v) the user interface.
[0117] Semantic segmentation of the room can be done either manually or automatically, using integrated tools available on augmented reality devices. However, as not all current AR devices are equipped with depth-sensing capturing hardware, and therefore for manual semantic segmentation, the augmented reality application allows the user themselves to manually generate and annotate semantic bounding boxes of objects of the target scene. The data acquired are then converted to the proposed spatial Scene Graph, resulting in an abstract representation of the scene. Both semantic segmentation and graph generation modules are performed locally on the AR devices, ensuring the privacy of the raw spatial data of the user. [0118] Once the Scene Graph is generated, it is sent to a local or remote server where SceneGen engine can calculate positional and orientation augmentation probability maps for the target scene. The prediction probability maps for all objects are generated in this step. Such approach would allow faster computation time, since current AR devices come with limited computational and memory resources. The results are sent back to the local device, in which can be parsed and visualized using the Augmented Reality GUI.
[0119] The instantiation system can toggle between two modes: Manual and SceneGen. In Manual mode, the object is placed in front of the user, on the intersection of the camera frontfacing vector direction with the floor. This would normally result in augmenting the object in the middle of the screen. While such conventional approach allows the user control the initial placement by determining the pose of the AR camera, in many cases additional movements are necessary to place the object in a plausible final location. In such cases, the user can then further move and rotate the objects to its desirable location. In SceneGen mode, the virtual object is augmented using the prediction of the embodiments of the system, resulting in faster and contextual placements.
[0120] The Scene Graph introduced in the embodiments is designed to capture spatial relationships between objects, object categories and the room. Overall, it has been found that each of the relationships presented improves the model’s ability to augment virtual objects in realistic placements in a scene. These relationships are important to understand the functional purposes of the space in addition to the individual objects.
[0121] In SceneGen, RoomPosition is used as a feature in predicting both orientation and position of objects. While this is a feature based solely on the position of the object, where it is in a room also has a strong impact on the function of the object and how it should be oriented. For example, a chair in a comer of the room is very likely to face towards the center of the room, while a chair in the middle of the room is more likely to face towards a table or a sofa. When analyzing the placement predictions probability maps and the user study results, the inventors have observed that the best orientation is not the same at each position. This is not only affected by the nearby objects, but also by the sampled position within the room. [0122] In the evaluation of SceneGen, the inventors have found a number of benefits in using an explicit model to predict object placements. One benefit is that if one wants to define a non-standard object to be placed in relation with standard objects by specifying a person’s own relationship distributions, it is feasible with the system but would not be possible for implicit models. For example, in a collaborative virtual environment, where special markers are desired to be placed near each user, one could specify distributions for relationships such as NextTo chair and Facing table, without needing to train these from a dataset.
[0123] Another benefit is that explicit models can be easily examined directly to understand why objects are being placed where they are. For example, the Bed orientation feature distribution, based on the environmental data set priors in Figure 9, marginalized with respect to all other variables except TowardsCenter show that beds are nearly 5 times as likely to face the center of the room. Marginalizing features except position of the Storage show that a storage is found in a comer of a room 63% of the time, along an edge 33% of the time, and only in the middle of the room in 4% of occurrences. [0124] One important consideration in the choice of dataset is that the system aims to leam spatial relationships for real world scenes. One can imagine idiosyncrasies of lived-in rooms, such as an office chair that is not always tucked into a desk but often left rotated away from it or a dining table pushed into a wall to create more space in a family room. Using personal living spaces, from the environmental dataset, as the priors, one can capture these relationships that exist only in real world, lived-in scenes.
[0125] One drawback of using the Matterport3D dataset as the environmental dataset is that it is not as large as some synthetic datasets. In the implementations here, the system groups objects into broader groups to ensure adequate representation to ensure that all object categories are represented well enough to approximate the distribution of a large feature space.
[0126] Another downside of using a real-world dataset is its accuracy in labeling where many human errors occur in this labor intensive process. Such mismatches are unlikely to happen in synthetic datasets as the geometry is already assigned in a digital format. To mitigate some of these concerns, the inventors have developed a labeling application that allow determination of the correct orientation of each objects, and also filter out rooms with corrupted scans and inaccurate labeling.
[0127] Where and how an object is placed in a scene is often very subjective and preferences can differ between users. This is demonstrated by the Likert scale plausibility ratings in Level V reference scenes in the user studies. Figures 14 and 15 show that some users would only give scores of somewhat plausible to scenes that are modelled from real world ground truth environmental data set rooms. This supports providing a heat map of probabilities for each sampled placement, as alternate high probability positions may be more preferable to different users. [0128] The results also indicate that most users prefer level IV scenes, with the heat map, compared to level III scenes, even though the placements use the same SceneGen models. This suggests that the inclusion of the heat map guides the users towards the system’s placement and may help in convincing them of the viability and reasoning for such a choice. [0129] It can also be seen that some users move objects to other high probability alternatives as seen in Figure 18. This is a similar result to the position prediction experiment, which compares the ground truth position to the closest of SceneGen’s top 5 predictions and shows that while the reference position may not always be the top prediction, it was often one of the top predictions. Moreover, results in Figure 15 show the subjectivity of an object placement is highly dependent on the size and type of object itself. In any room, there are very few natural places to put a bed. Hence the results for placing beds cluster in one or two high probability locations. Other objects such as decor are more likely to be subject to user preferences.
[0130] Figure 19 shows a demonstration of SceneGen adding virtual objects to a scene. The top left shows the target scene. A TV is added as shown in the top right, a table in the middle left, and a chair is added in the middle right. The bottom frame shows the augmented scene on the computer, compared to the actual scene.
[0131] Figure 20 shows an overall system diagram for one embodiment of a system that operates to insert virtual objects into scenes. The scene in this embodiment is a room 80, but as mentioned previously, the scene could be any defined area, including open areas. The depth sensors such as 82 could scan the scene, or it could be scanned by the AR headset 84 worn by a user 86. The user may or may not be in the scene. The user may have a computing device 88 that includes a user interface 90, a processor 94, and a memory 92. This device may allow the user to use the annotation tool previously mentioned, or to view the virtual object in the screen as shown in Figure 19.
[0132] The room sensors, and/or the headset, and the user’s device connect to the computing system that performs the methods described above. This computing system may have one or more processors such as 102, a memory 108, which may be embodied as computer code stored in the memory 108 to be executed by the processor. The computing system may include a database such as 110 that contains the datasets as well as stores new datasets generated by the depth sensors in the room. One should note that these components may all be included in a single computing device or distributed between the user’s device, multiple servers, multiple databases, etc. Advances in computing devices may reach a point where one device will contain all of the processing power and the memory to implement this processes. [0133] The embodiments introduce a framework to augment scenes with one or more virtual objects using an explicit generative model trained on spatial relationship priors. Scene Graphs from a dataset of scenes are aggregated into a Knowledge Model and used to train a probabilistic model. This explicit model allows for direct analysis of the learned priors and allows for users to input custom relationships to place non-standard objects alongside traditional objects. SceneGen places the object in the highest probability pose and also offers alternate highly likely placements.
[0134] The system implements SceneGen using an environmental data set such as the Matterport3D, a dataset composed of 3D scans of lived-in rooms, in order to understand object relationships in a real world setting, but can also use scans of real-world scenes. The features that SceneGen extracts to build the Scene Graph are assessed through an ablation study, identifying how each feature contributes to the model’s ability to predict realistic object placements. User studies also demonstrate that SceneGen is able to augment scenes in a much more plausible way than system that places objects randomly or in open spaces. The inventors also found that different users have their own preferences for where an object should be placed. Suggesting multiple high probability possibilities through a heat map allows users an intuitive visualization of the augmentation process.
[0135] Moreover, SceneGen is a framework that naturally fits into spatial computing applications. This has demonstrated this in an augmented reality application that augments a scene with a virtual object using SceneGen. Contextual scene augmentation can be useful in augmenting collaborative mixed reality environments or in other design applications, and using this framework allows for fast and realistic scene and content generation. Further modifications plan on improving the framework in providing the option to contextually augment non-standard objects by parameterizing topological relationships, a feature that would facilitate content creation for future spatial computing workflows.
[0136] It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the embodiments.

Claims

WHAT IS CLAIMED IS:
1. A method of augmenting scenes with virtual objects, comprising: accessing a target scene; extracting features from the target scene into a spatial scene graph representation of the target scene; generating one or more datasets from the spatial scene graph representation; using machine learning system to iteratively operate on the one or more datasets to by sampling positions and orientations in the target scene to create a probability map for placement of a virtual object in the scene; and predicting a viable placement for the virtual object in the target scene, producing a final scene.
2. The method as claimed in claim 1, further comprising rendering the target scene containing the virtual object on a display.
3. The method as claimed in claim 1, wherein accessing a target scene comprises scanning a real-world scene with one or more augmented reality sensors.
4. The method as claimed in claim 3, wherein scanning a real-world scene with one or more augmented reality sensors comprises scanning a real-world scene with depth sensors in the room.
5. The method as claimed in claim 3, wherein scanning a real-world scene with one or more augmented reality sensors comprises scanning a real-world scene with sensors in an augmented reality headset.
6. The method as claimed in claim 1, wherein accessing a target scene comprises using data from one of either a synthetic or real-world dataset.
7. The method as claimed in claim 1, wherein extracting features from the scene comprises extracting capturing position and orientational relationships of objects to other objects, to object groups, and to the scene.
8. The method as claimed in claim 7, wherein capturing position and orientational relationships comprises identifying a position of each object in the scene, a distance between each object and other objects, and any support relationships.
9. The method as claimed in claim 8, wherein capturing position and orientational relationships comprises identifying a support relationships between two objects when an object is either underneath or on top of another object.
10. The method as claimed in claim 7, wherein capturing positon and orientational relationship comprises grouping objects into one of the groups comprising Asymmetric, Symmetric, and Inside Facing.
11. The method as claimed in claim 12, wherein grouping objects further comprising creating data vectors for each object, wherein the data vectors for Asymmetric objects also includes orientational features.
12. The method as claimed in claim 1, further comprising receiving annotations from an annotation tool to indicate a pose for object in the target scene.
13. The method as claimed in claim 1, wherein using machine learning further comprises: developing a knowledge model using either the one or more datasets, or using manual inputs of the features by a user; training the knowledge model using the one or more datasets; using the trained knowledge model to estimate a probability distribution function as the probability map; and using the probability distribution function to predict the viable placement for the virtual object.
14. The method as claimed in claim 13, further comprising performing data reduction.
15. The method as claimed in claim 13, wherein training the knowledge model comprises evaluating model performance using ablation studies.
PCT/US2021/038948 2020-06-25 2021-06-24 Contextual augmentation using scene graphs WO2021263018A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063043904P 2020-06-25 2020-06-25
US63/043,904 2020-06-25

Publications (1)

Publication Number Publication Date
WO2021263018A1 true WO2021263018A1 (en) 2021-12-30

Family

ID=79281853

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/038948 WO2021263018A1 (en) 2020-06-25 2021-06-24 Contextual augmentation using scene graphs

Country Status (1)

Country Link
WO (1) WO2021263018A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040003042A1 (en) * 2001-06-28 2004-01-01 Horvitz Eric J. Methods and architecture for cross-device activity monitoring, reasoning, and visualization for providing status and forecasts of a users' presence and availability
US20090128564A1 (en) * 2007-11-15 2009-05-21 Canon Kabushiki Kaisha Image processing apparatus and image processing method
US20150302655A1 (en) * 2014-04-18 2015-10-22 Magic Leap, Inc. Using a map of the world for augmented or virtual reality systems
US20150350563A1 (en) * 2000-06-19 2015-12-03 Comcast Ip Holdings I, Llc Method and Apparatus for Targeting of Interactive Virtual Objects
US20150356774A1 (en) * 2014-06-09 2015-12-10 Microsoft Corporation Layout design using locally satisfiable proposals
US20180045963A1 (en) * 2016-08-11 2018-02-15 Magic Leap, Inc. Automatic placement of a virtual object in a three-dimensional space
US20190188915A1 (en) * 2007-09-25 2019-06-20 Apple Inc. Method and apparatus for representing a virtual object in a real environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150350563A1 (en) * 2000-06-19 2015-12-03 Comcast Ip Holdings I, Llc Method and Apparatus for Targeting of Interactive Virtual Objects
US20040003042A1 (en) * 2001-06-28 2004-01-01 Horvitz Eric J. Methods and architecture for cross-device activity monitoring, reasoning, and visualization for providing status and forecasts of a users' presence and availability
US20190188915A1 (en) * 2007-09-25 2019-06-20 Apple Inc. Method and apparatus for representing a virtual object in a real environment
US20090128564A1 (en) * 2007-11-15 2009-05-21 Canon Kabushiki Kaisha Image processing apparatus and image processing method
US20150302655A1 (en) * 2014-04-18 2015-10-22 Magic Leap, Inc. Using a map of the world for augmented or virtual reality systems
US20150356774A1 (en) * 2014-06-09 2015-12-10 Microsoft Corporation Layout design using locally satisfiable proposals
US20180045963A1 (en) * 2016-08-11 2018-02-15 Magic Leap, Inc. Automatic placement of a virtual object in a three-dimensional space

Similar Documents

Publication Publication Date Title
Ritchie et al. Fast and flexible indoor scene synthesis via deep convolutional generative models
US9916002B2 (en) Social applications for augmented reality technologies
US9754419B2 (en) Systems and methods for augmented reality preparation, processing, and application
Kraus et al. Immersive analytics with abstract 3D visualizations: A survey
Wang et al. Mixed reality in architecture, design, and construction
US8379968B2 (en) Conversion of two dimensional image data into three dimensional spatial data for use in a virtual universe
Cheng et al. ImageSpirit: Verbal guided image parsing
Keshavarzi et al. Scenegen: Generative contextual scene augmentation using scene graph priors
US11055891B1 (en) Real time styling of motion for virtual environments
Wong et al. Smartannotator an interactive tool for annotating indoor rgbd images
Hahn et al. Where are you? localization from embodied dialog
US20230394189A1 (en) Semi-supervised layout estimation of interior spaces from panorama images
US11928384B2 (en) Systems and methods for virtual and augmented reality
Partarakis et al. Adaptation and Content Personalization in the Context of Multi User Museum Exhibits.
Kan et al. Automatic interior Design in Augmented Reality Based on hierarchical tree of procedural rules
Luo et al. PEARL: Physical environment based augmented reality lenses for in-situ human movement analysis
Pintore et al. Mobile mapping and visualization of indoor structures to simplify scene understanding and location awareness
WO2021263018A1 (en) Contextual augmentation using scene graphs
Keshavarzi Indoor Scene Augmentation via Scene Graph Priors
Keshavarzi et al. Mutual scene synthesis for mixed reality telepresence
Manso et al. A novel robust scene change detection algorithm for autonomous robots using mixtures of gaussians
WO2021203076A1 (en) Method for understanding and synthesizing differentiable scenes from input images
Du Fusing multimedia data into dynamic virtual environments
Keshavarzi Contextual Spatial Computing: A Generative Approach
Okhovvat et al. Generating Common Spaces through Virtual Reality Telepresence and Shared Scene Synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829547

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21829547

Country of ref document: EP

Kind code of ref document: A1