EP3867883A1 - 3d scene synthesis techniques using neural network architectures - Google Patents
3d scene synthesis techniques using neural network architecturesInfo
- Publication number
- EP3867883A1 EP3867883A1 EP19873600.1A EP19873600A EP3867883A1 EP 3867883 A1 EP3867883 A1 EP 3867883A1 EP 19873600 A EP19873600 A EP 19873600A EP 3867883 A1 EP3867883 A1 EP 3867883A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- scene
- neural network
- objects
- information
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 132
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000015572 biosynthetic process Effects 0.000 title description 3
- 238000003786 synthesis reaction Methods 0.000 title description 3
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 15
- 238000013527 convolutional neural network Methods 0.000 claims description 56
- 238000012549 training Methods 0.000 claims description 42
- 238000003860 storage Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 230000006403 short-term memory Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 description 25
- 238000010191 image analysis Methods 0.000 description 20
- 239000013598 vector Substances 0.000 description 19
- 230000015654 memory Effects 0.000 description 15
- 210000004027 cell Anatomy 0.000 description 13
- 238000012360 testing method Methods 0.000 description 10
- 238000001514 detection method Methods 0.000 description 9
- 230000000306 recurrent effect Effects 0.000 description 9
- 230000004913 activation Effects 0.000 description 8
- 230000008901 benefit Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000009877 rendering Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000013256 coordination polymer Substances 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000005670 electromagnetic radiation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
- 230000032258 transport Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/10—Geometric effects
- G06T15/20—Perspective computation
- G06T15/205—Image-based rendering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/35—Categorising the entire scene, e.g. birthday party or wedding scene
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2210/00—Indexing scheme for image generation or computer graphics
- G06T2210/04—Architectural design, interior design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2210/00—Indexing scheme for image generation or computer graphics
- G06T2210/61—Scene description
Definitions
- This disclosure is related to techniques for synthesizing three- dimensional (3D) scenes from two-dimensional (2D) images utilizing neural networks and/or artificial intelligence (Al) algorithms.
- the task of synthesizing a 3D scene from a 2D image is very difficult and complex. This is due, at least in part, to the fact that a significant amount of visual information is lost when a 3D environment is captured in a 2D image.
- One potential technique for addressing this loss of visual information can involve the use of depth images to synthesize a 3D scene.
- the depth images contain distance information relating to objects captured in the 2D image that can be used to synthesize the 3D scene.
- depth images are not available and, therefore, cannot be used to synthesize the 3D scene.
- Figure 1 is a block diagram of a system in accordance with certain embodiments.
- Figure 2 is a block diagram of an exemplary scene generation system in accordance with certain embodiments.
- Figure 3 is a diagram illustrating an exemplary architecture for a scene generation system in accordance with certain embodiments.
- Figure 4 is a flow chart of a method for implementing an exemplary method or technique in accordance with certain embodiments.
- a scene generation system is configured to receive two-dimensional (2D) images and to synthesize 3D scenes corresponding to the 2D images.
- a neural network architecture comprising one or more neural networks can be configured to analyze the 2D images to detect objects, classify scenes and objects, and determine degree of freedom (DOF) information for objects in the 2D images.
- the neural network architecture can perform these tasks, at least in part, by utilizing inter-object and object-scene dependency information that is learned by the neural network architecture.
- the inter-object and object-scene dependency information captures, inter alia, the spatial correlations and dependencies among objects in the 2D images, as well as the correlations and relationships of objects to scenes associated with the 2D images.
- a 3D scene synthesizer can utilize the knowledge from the neural network architecture to synthesize 3D scenes without using depth images for the 2D images.
- the neural network architecture utilized by the scene generation system includes at least two neural networks, each of which is trained using a set of training images that are augmented or annotated with ground truth information.
- a first neural network can be trained to capture the inter-object and object-scene dependency information and to perform functions associated with object detection and scene classification.
- a second neural network can be trained to perform functions associated with calculating or estimating DOF information for the objects in the scenes of the 2D images.
- Each of two neural networks may be implemented as a convolutional neural network (CNN) or other similar type of neural network.
- CNN convolutional neural network
- the first neural network can be implemented as a convolutional long short-term memory (Conv LSTM) and the second neural network can be implemented as a regression convolutional neural network (regression ConvNet).
- the information extracted by the first neural network and the second neural network can be used jointly to provide inferences related to the 3D scenes being synthesized.
- the extracted information can be used to make inferences that describe poses, positions, dimensions, and/or other parameters for the objects within the 3D scenes being synthesized. This information can then be utilized by the 3D scene synthesizer to generate 3D scenes corresponding to the 2D images.
- the technologies discussed herein can be used in a variety of different environments.
- the technologies discussed herein can be used to synthesize 3D scenes for use in applications and/or devices associated with virtual reality, intelligent robot navigation (e.g., intelligent vacuum devices), interior design, computer vision, surveillance, and/or visibility analysis. While some of the examples described herein may pertain to synthesizing 3D scenes for indoor environments, it should be recognized that the techniques can be used to generate 3D scenes for any environment, including both indoor and/or outdoor environments.
- the embodiments described herein provide a variety of advantages over conventional scene generation techniques.
- One significant advantage is the ability to generate 3D scenes corresponding to 2D images without the use of depth information pertaining to the objects in the 2D images.
- Another significant advantage is that the scene generation techniques described herein are able to detect objects, classify objects and scenes, and determine DOF information with greater accuracy in comparison to conventional techniques.
- a further advantage is that the scene generation techniques described herein are able to synthesize the 3D scenes in a manner that significantly reduces errors in comparison to conventional techniques.
- the 3D scene synthesis techniques set forth in the disclosure are rooted in computer technologies that overcome existing problems in known scene generation systems, including problems dealing with loss of information associated with capturing real-world 3D environments in 2D images. These techniques describe a technical solution (e.g., one that utilizes various Al-based and/or neural network-based techniques) for overcoming such limitations.
- the scene generation system described herein can take advantage of novel Al and machine learning techniques to train neural networks to extract and infer information from 2D images that can be used to synthesize corresponding 3D environments associated with the 2D images.
- This technology-based solution marks an improvement over existing computing capabilities and functionalities related to 3D scene generation, and does so in a manner that improves the accuracy of the synthesized scenes.
- a system for synthesizing a 3D scene comprising: one or more computing devices comprising one or more processors and one or more non-transitory storage devices for storing instructions, wherein execution of the instructions by the one or more processors causes the one or more computing devices to: access a 2D image comprising a scene; execute a first neural network that is configured to determine a first set of inferences associated with detecting objects in the scene and classifying the scene; execute a second neural network that is configured to determine a second set of inferences associated with determining degree of freedom information for the objects in the scene; and synthesize a 3D scene that corresponds to the scene included in the 2D image using the first set of inferences provided by the first neural network and the second set of inferences provided by the second neural network.
- a method for synthesizing a 3D scene comprising: accessing a 2D image comprising a scene; executing a first neural network that is configured to determine a first set of inferences associated with detecting objects in the scene and classifying the scene; executing a second neural network that is configured to determine a second set of inferences associated with determining degree of freedom information for the objects in the scene; and synthesizing a 3D scene that corresponds to the scene included in the 2D image using the first set of inferences provided by the first neural network and the second set of inferences provided by the second neural network.
- a computer program product for synthesizing a 3D scene, the computer program product comprising a non-transitory computer-readable medium including codes for causing a computer to: access a 2D image comprising a scene; execute a first neural network that is configured to determine a first set of inferences associated with detecting objects in the scene and classifying the scene; execute a second neural network that is configured to determine a second set of inferences associated with determining degree of freedom information for the objects in the scene; and synthesize a 3D scene that corresponds to the scene included in the 2D image using the first set of inferences provided by the first neural network and the second set of inferences provided by the second neural network.
- any aspect or feature that is described for one embodiment can be incorporated to any other embodiment mentioned in this disclosure.
- any of the embodiments described herein may be hardware-based, may be software-based, or may comprise a mixture of both hardware and software elements.
- the description herein may describe certain embodiments, features, or components as being implemented in software or hardware, it should be recognized that any embodiment, feature, or component that is described in the present disclosure may be implemented in hardware and/or software.
- Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
- a computer-usable or computer-readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.
- the medium can be a magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
- the medium may include a computer-readable storage medium, such as a semiconductor or solid state memory, a magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, an optical disk, etc.
- a data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus.
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories that provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution.
- I/O devices including, but not limited to, keyboards, displays, pointing devices, etc. may be coupled to the system either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.
- Modems, cable modems, and Ethernet cards are just a few of the currently available types of network adapters.
- Figure 1 is an exemplary system 100 according to certain embodiments.
- a scene generation system 150 can be stored on, and executed by, one or more servers 120.
- the scene generation system 150 comprises one or more images 130, an image analysis component 140, a 3D scene synthesizer 160, one or more synthesized 3D scenes 170, and one or more 3D applications 180.
- the scene generation system 150 and/or one or more servers 120 are in communication with one or more computing devices 1 10 over a network 190.
- the scene generation system 150 can be configured to perform any and all functions described herein with respect to analyzing images 130, extracting information from images 130, generating synthesized 3D scenes 170, and/or incorporating synthesized 3D scenes into 3D applications 180.
- the scene generation system 150 can be configured to receive images 130 (e.g., which may correspond to monocular images and/or 2D images), and to generate synthesized 3D scenes 170 corresponding to the images 130.
- the scene generation system 150 generates the synthesized 3D scenes 170 without relying on, or otherwise utilizing, depth information and/or depth images.
- the synthesized 3D scenes 170 generated can be integrated into various types of 3D applications 180 (e.g., applications associated virtual reality, robot navigation, interior design, computer vision, etc.).
- the scene generation system 150 can be configured to generate synthesized 3D scenes 170 for any type of indoor scene. This can include, but is not limited to, generating synthesized 3D scenes 170 corresponding to rooms located within a residential building, a commercial building, an industrial building, and/or other types of buildings or structures. For example, in response to receiving an image 130 corresponding to a bedroom that includes various furniture items (e.g., a bed, a dresser, a nightstand, etc.), the scene generation system 150 can generate a synthesized 3D scene 170 corresponding to the image 130 of the bedroom (e.g., that includes a 3D rendering of the bedroom showing placement of the furniture items according to the layout or floorplan in the image).
- a synthesized 3D scene 170 corresponding to the image 130 of the bedroom (e.g., that includes a 3D rendering of the bedroom showing placement of the furniture items according to the layout or floorplan in the image).
- the scene generation system 150 can generate a synthesized 3D scene 170 corresponding to the image 130 of the bathroom (e.g., that includes a 3D rendering of the bathroom showing placement of the fixtures according to the layout or floorplan in the image).
- a synthesized 3D scene 170 corresponding to the image 130 of the bathroom (e.g., that includes a 3D rendering of the bathroom showing placement of the fixtures according to the layout or floorplan in the image).
- the scene generation system 150 can generate a synthesized 3D scene 170 corresponding to the image 130 of the room (e.g., that includes a 3D rendering of the room showing placement of the equipment and/or office items according to the layout or floorplan in the image).
- a synthesized 3D scene 170 corresponding to the image 130 of the room (e.g., that includes a 3D rendering of the room showing placement of the equipment and/or office items according to the layout or floorplan in the image).
- the scene generation system 150 can additionally, or alternatively, be configured to generate synthesized 3D scenes 170 for various outdoors scenes.
- the images 130 received by the scene generation system 150 can be utilized to generate synthesized 3D scenes 170 corresponding to parks, parking lots, driveways, yards, streets, buildings (e.g., residential, commercial, and/or industrial buildings), etc.
- the images 130 can be transmitted to and/or retrieved by the scene generation system 150.
- the scene generation system 150 can store the images 130 (e.g., in a database).
- the images 130 may represent digital representations corresponding to photographs, pictures, sketches, drawings, and/or the like.
- the images 130 may initially be captured by recording light or other electromagnetic radiation electronically (e.g., using one or more image sensors) or chemically (e.g., using light-sensitive materials or films). Any images 130 that are not originally created in a digital format can be converted to a digital format using appropriate conversion devices (e.g., image scanners and optical scanners).
- the images 130 can represent 2D images and/or monocular RGB (red-green-blue) images. Because the techniques described herein do not require depth information to generate synthesized 3D scenes 170, the images 130 are not required to include depth information and/or comprise depth images. However, in certain embodiments, such depth information and/or depth images, if available, may be utilized to supplement the techniques described herein.
- the image analysis component 140 can be configured to perform functions associated with analyzing the images 130 and/or extracting information from the images 130. Generally speaking, the image analysis component 140 can extract any type of information from the images 130 that can be used to generate synthesized 3D scenes 170 from the images 130. For example, the image analysis component 140 can be configured to extract information for identifying and detecting objects (e.g., desks, beds, toilets, manufacturing equipment, persons, animals, pets, household applications, fixtures, vehicles, and/or any other objects) in the images 130, identifying scenes, classifying objects and scenes captured in the images 130, and determining degree of freedom (DOF) information for objects captured in the images 130.
- objects e.g., desks, beds, toilets, manufacturing equipment, persons, animals, pets, household applications, fixtures, vehicles, and/or any other objects
- DOF degree of freedom
- the image analysis component 140 can further be configured to extract and/or determine inter-object and object-scene dependency information (e.g., such as dependency information 254 in Figure 2).
- This dependency information extracted from training images can include data associated with spatial dependencies between objects in the images 130, as well as relationships between objects and the scene. As explained in further detail below, this dependency information can be very useful for analyzing a scene captured in an image 130 and generating a synthesized 3D image 170.
- the dependency information can initially be extracted by the image analysis component 140 during a training phase that enables the image analysis component 140 to learn and use the inter-object and object-scene dependency information.
- the dependency information can also be extracted from images 130 during testing and operational phases (e.g., to enable the image analysis component 140 to refine and/or update the learned dependency information).
- the image analysis component 140 can utilize a neural network architecture comprising one or more neural networks to analyze, extract, and/or infer information from the images 130.
- the image analysis component 140 can include a first neural network that is configured to extract information from the images 130 (e.g., including, inter alia, the dependency information), detect objects in the scenes captured in the images 130, and classify both the objects and the scenes.
- the image analysis component 140 can also include a second neural network that is configured to estimate and/or calculate DOF information for the objects in the scenes.
- both the first and second neural networks can be trained, at least in part, using a set of training images (e.g., which may be included in the images 130) that are annotated or associated with ground truth information (e.g., which can include information identifying objects in an image, a scene associated with the an image, DOF information associated with objects in an image, etc.).
- a set of training images e.g., which may be included in the images 130
- ground truth information e.g., which can include information identifying objects in an image, a scene associated with the an image, DOF information associated with objects in an image, etc.
- the 3D scene synthesizer 160 can be configured to perform any functions associated with generating, synthesizing, and/or creating the synthesized 3D scenes 170.
- the 3D scene synthesizer 160 may utilize any information produced by the image analysis component 140 to create the synthesized 3D scenes 170.
- the information and inferences produced and/or generated by the first and second neural networks can be utilized jointly to create the synthesized 3D scenes 170.
- the manner in which the synthesized 3D scenes 170 are created and/or rendered can vary.
- the 3D scene synthesizer 160 can utilize one or more 3D models (e.g., 3D models of a room or other location having specific dimensions) to assist with the creation of the synthesized 3D scenes 170.
- Objects can be inserted into the 3D models according to the object parameters (e.g., dimensions, location, pose, DOF values, object labels, etc.) that are derived by the image analysis component 140.
- the 3D scene synthesizer 160 can utilize the Three.js JavaScript library and/or WebGL (Web Graphics Library) to assist with generating the synthesized 3D scenes 170.
- Other types of 3D creation tools can also be used to create the synthesized 3D scenes 170.
- the synthesized 3D scenes 170 can generally represent any type of 3D representation, 3D rendering, 3D image, digital 3D environment, and/or the like.
- the synthesized 3D scenes 170 can be derived from, and correspond to, scenes that are captured in the images 130 (e.g., 2D images and/or monocular images).
- the type and content of the synthesized 3D scenes 170 can vary greatly.
- the synthesized 3D scenes 170 can correspond to any indoor and/or outdoor environment.
- the synthesized 3D scenes 170 can include 3D representations corresponding to rooms or locations included inside of enclosed structures (e.g., houses, restaurants, offices, manufacturing plants, residential buildings, commercial buildings, industrial buildings, garages, sheds, etc.). In certain embodiments, the synthesized 3D scenes 170 can also include 3D representations corresponding to outdoor locations (e.g., parks, streets, landmarks, backyards, playgrounds, etc.).
- the synthesized 3D scenes 170 can be generated to include one or more objects corresponding to objects that are captured in the images 130.
- any type of object may be inserted into the synthesized 3D scenes 170.
- a synthesized 3D scene 170 for a bedroom may include objects corresponding to a bed, dresser, and/or other bedroom objects captured in a corresponding image 130 utilized to create the synthesized 3D scene 170.
- a synthesized 3D scene 170 for a playground may include objects corresponding to a swing set, a basketball hoop, sports equipment, etc.
- the type of objects inserted into the synthesized 3D scene 170 can vary greatly based on the scene associated with a corresponding image 130 and/or based on the objects detected in a corresponding image 130.
- Any synthesized 3D scenes 170 created, or otherwise used, by the scene generation system 150 can be stored in one or more databases (e.g., database 210 in Figure 2).
- the synthesized 3D scenes 170, and functionality provided by the scene generation system 150 can be utilized in connection with, and/or integrated with, various types of 3D applications 180.
- the 3D applications 180 can be executed by, or otherwise associated with, various types of devices and apparatuses (e.g., computing devices 1 10, specialized robots, virtual reality equipment, etc.).
- the 3D applications 180 can include applications associated with providing virtual reality experiences.
- the synthesized 3D scenes 170 can be utilized to create virtual reality content for games, training, education, entertainment, and/or other purposes.
- the 3D applications 180 can include applications for intelligent robots or devices.
- intelligent robots can be configured to perform various functions (e.g., vacuum functions, manufacturing functions, assembly line functions, etc.), and integrating the synthesized 3D scenes 170 into these intelligent robot systems can assist the robots with navigating around a particular room, location, and/or environment.
- the 3D applications 180 can include applications that assist with interior design.
- the synthesized 3D scenes 170 can be integrated into a digital design application that assists designers or customers with planning a layout for a room or location (e.g., that enables designers to visualize placement of furniture, wall-mounted picture frames, etc.).
- the 3D applications 180 can include applications associated with surveillance systems and/or visibility analysis systems.
- the synthesized 3D scenes 170 can be integrated into a surveillance system or visibility analysis system to determine an optimal placement of surveillance cameras and/or other objects.
- the 3D applications 180 can include applications associated with providing holograms.
- the synthesized 3D scenes 170 can be utilized to generate and/or render holograms.
- the 3D applications 180 can include many other types of applications, and can be utilized by various devices, apparatuses, equipment, robots, and/or systems.
- the queries and/or requests to generate synthesized 3D scenes 170 from the images 130 can be submitted directly to the scene generation system 150 (e.g., via one or more input devices attached to the one or more servers 120 hosting the scene generation system 150).
- the requests can additionally, or alternatively, be submitted by one or more computing devices 1 10.
- a plurality of computing devices 1 10 may be connected to the scene generation system 150 to enable remote individuals to access the scene generation system 150 over a network 190.
- the servers 120 and/or computing devices 1 10 may present information, functions, and/or interfaces (e.g., graphical user interfaces) that enable individuals to provide the images 130, submit requests to generate 3D synthesized scenes 170, view the synthesized 3D scenes 170, edit the synthesized 3D scenes 170 (e.g., adding, editing, deleting, and/or moving objects in the scenes), store the synthesized 3D scenes 170, manage access to the synthesized 3D scenes 170, integrate the synthesized 3D scenes 170 into 3D applications 180, and/or perform other related functions.
- information, functions, and/or interfaces e.g., graphical user interfaces
- the computing devices 1 10 may represent desktop computers, laptop computers, mobile devices (e.g., smart phones, personal digital assistants, tablet devices, vehicular computing devices, or any other device that is mobile in nature), and/or other types of computing devices.
- the scene generation system 150 is stored on one or more servers 120.
- the one or more servers 120 may generally represent any type of computing device, including any of the computing devices 1 10 mentioned above.
- the one or more servers 120 comprise one or more mainframe computing devices that execute web servers capable of communicating with the computing devices 1 10 and/or other devices over the network 190.
- the network 190 may represent any type of communication network, e.g., such as one that comprises a local area network (e.g., a Wi-Fi network), a personal area network (e.g., a Bluetooth network), a wide area network, an intranet, the Internet, a cellular network, a television network, and/or other types of networks.
- a local area network e.g., a Wi-Fi network
- a personal area network e.g., a Bluetooth network
- All the components illustrated in Figure 1 can be configured to communicate directly with each other and/or over the network 190 via wired or wireless communication links, or a combination of the two.
- the computing devices 1 10 and servers 120 can also be equipped with one or more transceiver devices, one or more computer storage devices (e.g., RAM, ROM, PROM, SRAM, etc.) and one or more processing devices (e.g., a central processing unit) that are capable of executing computer program instructions.
- the computer storage devices can be physical, non-transitory mediums.
- FIG. 2 is a block diagram of an exemplary scene generation system 150 in accordance with certain embodiments of the present invention.
- the scene generation system 150 includes one or more storage devices 201 that are in communication with one or more processors 202.
- the one or more storage devices 201 can include: i) non-volatile memory, such as, for example, read-only memory (ROM) or programmable read-only memory (PROM); and/or (ii) volatile memory, such as, for example, random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM), etc.
- RAM random access memory
- DRAM dynamic RAM
- SRAM static RAM
- storage devices 201 can comprise (i) non-transitory memory and/or (ii) transitory memory.
- the one or more processors 202 can include one or more central processing units (CPUs), controllers, microprocessors, digital signal processors, and/or computational circuits.
- the one or more storage devices 201 store, inter alia, data and instructions associated with one or more databases 210, an image analysis component 140, a first neural network 220, a second neural network 230, a 3D scene synthesizer 160, and one or more 3D applications 180.
- the one or more processors 202 are configured to execute instructions associated with these components, and any other components, that are stored on the one or more storage devices 201. Each of these components is described in further detail below.
- the database 210 stores a plurality of images 130, extracted information 250, and/or ground truth information 260.
- the images 130 can include both training images (e.g., that are annotated or augmented with ground truth information 260) and images that are used during testing or operational phases.
- the extracted information 250 may generally include any information and/or data that can be extracted from the images 130 including, but not limited to, scene information 251 , object information 252, degree of freedom (DOF) information 253, and dependency information 254.
- scene information 251 e.g., object information 252, degree of freedom (DOF) information 253, and dependency information 254.
- DOF degree of freedom
- the scene information 251 can include any data associated with detecting, identifying, and/or classifying a scene captured in an image 130.
- the scene information 251 may indicate that a scene captured in an image 130 corresponds to a particular type of indoor location or outdoor location.
- exemplary scene information 251 may indicate that an image 130 corresponds to a scene associated with a bedroom, bathroom, yard, garage, etc.
- the object information 252 can include any data associated with detecting, identifying, and/or classifying an object captured in an image 130.
- the object information 252 may indicate that a scene captured in an image 130 includes a specific type of object. Nearly any type of object can be included in the scene captured in an image 130.
- exemplary object information 252 may indicate that an image 130 includes objects corresponding to various types of inanimate articles (e.g., beds, desks, windows, tools, appliances, industrial equipment, curtains, sporting equipment, fixtures, vehicles, etc.), living things (e.g., human beings, animals, etc.), structures (e.g., buildings, landmarks, etc.), and/or the like.
- the object information 252 can also include data indicating parameters (e.g., dimensions, coordinates, locations, etc.) associated with the objects.
- the DOF information 253 can include any data associated with detecting, determining, estimating, and/or measuring degree of freedom data associated with objects in images 130 and/or any other data associated with the movement capabilities of objects captured in images 130.
- the DOF information 253 can include six degrees of freedom (6-DOF) information, or a portion thereof, associated with the objects. This 6- DOF information can be used to indicate the freedom of movement of an object in a 3D space.
- the DOF information 253 can be useful for placing the objects in a 3D space during the generation of the synthesized 3D scenes 170.
- the dependency information 254 can include any data associated with detecting, determining, estimating, and/or indicating the relationships among objects in the images 130 and/or relationships among objects and scenes.
- the dependency information 254 can include inter-object dependency information that indicates the spatial dependencies or relationships between objects included in an image 130.
- the dependency information 254 can also include object-scene dependency information that indicates relationships between objects in a scene and the scene itself.
- the dependency information 254 can be learned during a training phase (e.g., using training images that include ground truth information 260).
- the dependency information 254 can be very useful for analyzing a scene captured in an image 130. For example, given an image of a bedroom, the detection of an object corresponding to a “bed” can increase the probability that a “nightstand” will be detected in neighboring locations adjacent to the bed. Likewise, the fact that a scene captured in an image 130 corresponds to a bedroom increases the probability that the scene will include objects corresponding to a bed, nightstand, dresser, and/or other common bedroom items.
- the inter-object and object-scene relations captured in the dependency information 254 can be used in various ways to analyze the image 130 (e.g., to detect and classify both objects and scenes, to ascertain object parameters, etc.) and to generate a synthesized 3D scene 170 corresponding to the image 130.
- the image analysis component 140 can be configured to perform various functions associated with analyzing the images 130 and/or extracting information from the images 130. This can include functions associated with generating, identifying, and/or detecting the aforementioned extracted information 250, such as the scene information 251 , object information 252, DOF information 253, and dependency information 254.
- the image analysis component 140 can include one or more neural networks (e.g., such as the first neural network 220 and the second neural network 230) for determining and/or utilizing the extracted information 250.
- a first neural network 220 can be configured to provide inferences and/or determinations on various scene understanding tasks, including tasks associated with detecting objects and classifying both objects and scenes.
- the first neural network 220 can be configured to perform any or all of the following tasks: detecting objects and scenes, classifying objects and scenes, capturing dependencies and relationships among objects and scenes (e.g., such as inter-object and object- scene dependency information), mapping objects in a scene to semantic object labels and semantic scene labels, and/or other related functions. Performance of some or all of these tasks and/or other tasks can involve extracting and/or utilizing the scene information 251 , object information 252, and/or dependency information 254.
- a second neural network 230 can be configured to provide inferences and/or determinations for various scene understanding tasks, including tasks associated with estimating and/or determining DOF information 253 associated with objects detected in the images 130.
- the inferences and/or determinations provided by the first neural network 220 and the second neural network 230 can be jointly used to synthesize and/or generate the synthesized 3D scenes 170.
- the configurations of the first neural network 220 and the second neural network 230 can vary.
- both the first neural network 220 and the second neural network 230 can be implemented, at least in part, using a convolutional neural network (CNN).
- CNN convolutional neural network
- the first neural network 220 can be implemented as a convolutional long short-term memory (Conv LSTM).
- Conv LSTM can include a LSTM structure that integrates the CNN with a recurrent neural network.
- the second neural network 230 can be implemented as a regression convolutional neural network (regression ConvNet).
- the first neural network 220 and the second neural network 230 can be trained utilizing images 130 that include, or are associated with, ground truth information 260.
- the ground truth information 260 can generally include any information that can assist the first neural network 220 and/or the second neural network 230 with performing scene understanding tasks including, but not limited to, tasks such as detecting objects and scenes, classifying objects and scenes, and/or determining DOF information for objects.
- the ground truth information 260 can include annotations, information, and/or data that identify object segments in the scenes that are captured in a set of training images included in the images 130 stored on the scene generation system.
- the object segments can be used to identify the objects in the scenes, and can be augmented with various data pertaining to the objects (e.g., dimensions, DOF information, position information, semantic labels, etc.).
- the ground truth information 260 can also include information that identifies the scenes and/or provides other information relating to the scenes (e.g., semantic labels, and dimensions of a room, floorplan, or layout corresponding to a scene).
- the type of ground truth information 260 utilized to train the first neural network 220 and second neural network 230 can vary.
- the ground truth information 260 can include any or all of the following: DOF values for objects in images 130; dimensions for objects in images 130; bounding boxes that identify boundaries of the objects in images 130; location information for objects in images 130 (e.g., which may identify the 3D coordinates of the objects in a 3D space or the location of the object on a floorplan or layout); yaw, pitch, and roll data for objects in the images 130; information indicating poses, positions, and orientation of objects in the images 130; object and scene identifiers and/or classifications; semantic labels associated with objects and scenes; dimension and contour information for scenes; and/or other types of data describing the objects and/or scenes.
- the ground truth information 260 permits the dependency information 254 to be learned by the image analysis component 140 during a training phase or stage.
- the first neural network 220 of the image analysis component 140 can include an LSTM structure that captures the dependency information 254 during training. This knowledge can then be utilized by the first neural network 220, or other component of the image analysis component 140, to perform scene understanding tasks (e.g., relating to classifying/detecting objects and scenes).
- the first neural network 220 and the second neural network 230 can be trained separately.
- the training of the first neural network 220 can include, inter alia, mapping pixels of object segments identified in the ground truth information 260 to both semantic object labels and semantic scene labels, and utilizing a recurrent unit of the first neural network 220 to capture and learn dependency information 254 (including inter-object dependency information and object-scene dependency information).
- the training of the second neural network 230 can include, inter alia, mapping pixels of object segments identified in the ground truth information 260 to values that describe an object’s pose, position, and dimension within a 3D space. Further details regarding exemplary training techniques for both the first neural network 220 and the second neural network 230 are discussed below with respect to Figure 3.
- Exemplary embodiments of the scene generation system 150 and the aforementioned sub-components are described in further detail below. While the sub-components of the scene generation system 150 may be depicted in Figures 1 and 2 as being distinct or separate from one other, it should be recognized that this distinction may be a logical distinction rather than a physical distinction. Any or all of the sub-components can be combined with one another to perform the functions described herein, and any aspect or feature that is described as being performed by one sub-component can be performed by any or all of the other sub-components. Also, while the sub-components of the scene generation system 150 may be illustrated as being implemented in software in certain portions of this disclosure, it should be recognized that the sub-components described herein may be implemented in hardware and/or software.
- FIG. 3 is a diagram illustrating an exemplary architecture 300 for a scene generation system 150 in accordance with certain embodiments.
- the exemplary architecture 300 illustrates, inter alia, underlying configurations for the first neural network 220 (e.g., which may be implemented as a Conv LSTM 310) and the second neural network 230 (e.g., which may be implemented as a regression ConvNet 320), and how the first neural network 220 and the second neural network 230 cooperate to produce joint inferences 340 that can be utilized to construct a synthesized 3D scene 170.
- the first neural network 220 e.g., which may be implemented as a Conv LSTM 310
- the second neural network 230 e.g., which may be implemented as a regression ConvNet 320
- the images 130 on the left side of the figure can include a set of training images that are annotated with ground truth information 260, which is utilized for training the Conv LSTM network 310 and the regression ConvNet network 320.
- the Conv LSTM network 310 includes a CNN that is connected to an LSTM structure 330 that comprises two LSTM modules.
- the two LSTM modules can compute a Softmax output for each LSTM hidden layer value in order to obtain a semantic scene label loss 350 and a semantic object label loss 360.
- the regression ConvNet 320 can be trained using a geometric loss 370, which measures the correctness between ground truth DOF values and regression values obtained by the regression ConvNet 320.
- the Conv LSTM 310 and regression ConvNet 320 jointly provide inferences 340 for objects’ poses, positions, and dimensions in a 3D space. These inferences 340 are utilized to generate a 3D scene 170 that agrees with the floor plan of a query image 130 (e.g., a 2D and/or monocular image).
- a query image 130 e.g., a 2D and/or monocular image.
- the Conv LSTM 310 and regression ConvNet 320 are configured to provide inferences 340 on images 130 without ground truth information 260.
- the inferences 340 from the Conv LSTM 310 and regression ConvNet 320 can be utilized for implementing various scene understanding tasks, including scene/object classification, object detection, and object DOF estimation.
- the Conv LSTM 310 and regression ConvNet 320 can be configured to perform these scene understanding tasks for indoor scenes (e.g., corresponding bedrooms, bathrooms, and/or other indoor scenes).
- the Conv LSTM 310 and regression ConvNet 320 can additionally, or alternatively, be configured to perform these scene understanding tasks for outdoor scenes in certain embodiments.
- the Conv LSTM network 310 integrates a CNN to a recurrent neural network with the LSTM structure 330.
- the CNN can take pixels of image regions as inputs at a low level, and passes high dimensional vectors that represent either an object within a scene image or a holistic scene image to the LSTM structure 330.
- the memory unit of LSTM structure 330 can capture dependency information 254, including information related to inter-object spatial context and object-scene dependencies.
- Both semantic object label loss 360 and semantic scene label loss 350 that measure the label consistencies of both scenes and objects can be utilized for governing the optimization of Conv LSTM 310.
- the 3D geometric loss 370 is utilized for training the regression ConvNet 320, which maps pixels from ground truth object regions to continuous values that describe an object’s pose, position, and dimension within a 3D space.
- This section describes an exemplary technique that may be utilized for implementing the Conv LSTM 310.
- ( g G l G 2 ,— G k ) be the K ground truth objects within a scene image /.
- a typical region-based CNN R-CNN
- R-CNN region-based CNN
- a CNN architecture can be used that follows AlexNet, and the pre-trained network weights up to the 7th fully connected layer are used as initial weights for training the network.
- AlexNet 2012 ImageNet Large Scale Visual Recognition Challenge
- ILSVRC-2012 2012 ImageNet Large Scale Visual Recognition Challenge
- the network When directly fine-tuning the CNN with ground truth objects and scenes, the network is capable of providing segment-level inferences to query object proposals within a scene image.
- the LSTM structure 330 can be built on the top of ground truth object segments so that the objects’ spatial context along with object-scene dependencies can be learned.
- the LSTM structure 330 receives and considers CNN features of all objects X within the same scene as inputs that span various“time-steps” of the recurrent unit. Specifically, a memory cell of LSTM structure 330 contains four main components, including an input gate, a self-recurrent neuron, a forget gate and an output gate. When each ground truth object region reaches the LSTM structure 330 through the R-CNN, the activation at the LSTM’s input gate, the candidate value C t and the activation at the memory cell of LSTM can be computed as: where:
- s( ) is a sigmoid function
- W in denotes the weight parameters of the input gate
- X j is the input vector to the LSTM unit or structure
- b t denotes the bias vector parameters of the input gate
- C t denotes the candidate value
- tanh( ) is the hyperbolic tangent function
- W c denotes the weight parameters of the cell state
- x t is a signal vector of current state
- U c denotes the weight parameters for the cell state
- b c denotes the bias vector parameters of the cell state
- W f denotes the weight parameters of the forget gate
- U f denotes the weight parameters for the cell state of the previous time-step
- bf denotes the bias vector parameters of the forget gate.
- the s( ⁇ ) can represent a sigmoid layer that determines how much information is going through this layer and outputs values O s e (0, 1], and the tanh(-) layer outputs values O tanh e (-1, 1) .
- the forget gate determines the new cell state C t by deciding how much information of another segment S i1i should be forgotten. Given the values of the input gate activation i t , the forget gate activation f and the candidate value C £ , the new cell state C t can be obtained using:
- C t denotes the cell state vector
- ft is the activation vector at the forget gate
- the output gate value O t can then be obtained based on the CNN end output X j , the hidden layer value obtained from another segment entry, and the updated cell state value C t through:
- O j is the activation vector of the output gate
- s( ) is a sigmoid function
- W 0 denotes the weight parameters of the output gate
- j is an input vector to the LSTM unit or structure
- U 0 denotes the weight parameters for the output state
- V 0 denotes the weight parameters of the output signals
- bf denotes the bias vector parameters of the output gate.
- the new hidden layer value h t can be computed using:
- hi is the hidden state of the time-step t
- O j is the activation vector of the output gate
- tanh( ) is the hyperbolic tangent function
- C t denotes the cell state vector.
- W i t W c , W f , W 0 , U U c , U f , U 0 and V 0 are weight parameters of the model
- b t , b f , b c and b 0 are bias vectors.
- the learning of the Conv LSTM 310 also considers the object-scene relation.
- the scene image in which the objects are located can be considered as an additional image region, and it can be added to the last“time-step” entry of the LSTM structure 330.
- the output of each LSTM“time- step” can be used for computing the semantic object label loss and the semantic scene label loss.
- the last“time-step” output can be considered as the scene representation because it includes information on both local object regions and the holistic scene. Both object and scene label consistencies can be measured with the cross entropy losses: ⁇ total — - ⁇ object ( o’ ⁇ o) T - ⁇ scene (
- ⁇ object denotes the object loss
- P o denotes the ground truth object category
- fi 0 denotes the predicted object category
- Pi denotes the predicted scene category.
- the p can be obtained by passing y through a Softmax function, and p is the one hot vector that represents the ground truth label of an instance.
- objects within the same scene image can be fed into the same unit of LSTM structure 330 at different“time-step” entries.
- Benefiting from the internal state module of LSTM, spatial dependencies between objects and dependencies between objects and the scene category can be fully explored when optimizing the network.
- a set of segment proposals can be generated following the same strategy as in the training stage. These segment proposals along with the scene image can be fed into the Conv LSTM network 310.
- object detection can be achieved by ranking segment proposals’ output scores.
- This section describes an exemplary technique that may be utilized for implementing the regression ConvNet 320.
- 6-DOF degrees of freedom
- the degrees of freedom for each object can be parametrized as (p x , p y , p z , d x , d y , d z , roll, pitch, yaw), where the first three variables indicate object’s 3D translations from a zero point, the middle three variables are an object’s 3D dimensions, and the last three variables indicate the objects’ rotations against each axis.
- Reasonable constraints can be applied to simplify the problem.
- object categories e.g., such as be and table
- the remaining parameters that need to be estimated for determining an objects’ placement into a 3D space form a 6-dimensional vector ⁇ p x , p y , d x , d y , d z , yaw), among which p x , d x and d y can be inferred from 2D object detection results if it is assumed the image plane is parallel to the x-z plane.
- the regression ConvNet 320 can be used to estimate (p y , d z , yaw).
- the regression ConvNet 320 can be trained using annotations on both ground truth object segments, as well as segment proposals obtained from an unsupervised object segmentation method constrained parametric min-cut (CPMC). For the latter, corresponding depth channels can be used to compute the annotations in the training stage. In the testing or operational stage, only RGB channels of object segment proposals may be utilized as inputs to the regression ConvNet 320 for pose and position estimation. Empirically, the commonly used least square loss can be highly sensitive to outliers, and in this case outliers can be easily observed. Therefore, a robust square loss instead can be chosen for training the regression ConvNet:
- e is the L 2 -distance based loss on the pose.
- the error e is the L 2 -distance
- the regression ConvNet 320 can be built by applying some modifications to the AlexNet architecture, and the resulting ConvNet can follow the stream: C(ll, 96, 4) ® ReLU ® P(2, 2) ® C(5, 256, 1) ® ReLU ® C(3, 384, 1) ® P( 2, 2) ® F(512) ® ReLU ® F(128) ® ReLU.
- Experimental results may suggest fine-tuning a pre-trained ConvNet usually leads to a higher loss when training the regression ConvNet 320. Therefore, the ConvNet can be trained from scratch using random initial weights. This can start with an initial learning rate of 0.01 and decay the learning rate by 0.96 after every 2,000 iterations. The training can be stopped at 20,000 iterations.
- This section describes an exemplary technique that may be utilized for generating inferences that can be utilized to create the synthesized 3D scene 170.
- the inference stage of this approach can start with obtaining a set of figure-ground object segments from each scene image.
- the CPMC can be utilized to test monocular images for obtaining independent figure-ground overlapping image partitions by solving a sequence of constrained parametric min-cut problems, while not requiring any prior information on objects’ ground truth labels or locations.
- s ⁇ S 1 ,S 2 ,- ,S r ⁇ is used to define T object segment proposals that are generated from an image.
- Each segment S £ can be a binary matrix, where 0 denotes the background and 1 denotes the foreground.
- the candidate image segment proposals can fit the R-CNN framework, and use the network’s fully-connected layer output for training object detectors.
- the Conv LSTM 310 can be defined to handle a fixed number of K object segments in both training and testing/operational stages.
- the remaining K- 1 segment proposals that jointly affect the representation of S £ can be expected to be contextually meaningful.
- T is a subset of object segment proposals
- V denotes the object segment hypotheses with edges
- S denotes the object segment proposals generated from an image
- Wi j denotes the chi-squared distance between a pair of object segments
- pj is the cost of selecting a new segment proposal
- Np- denotes the number of selected object segments
- K is the constraint of the number of selected object segments.
- w tj denotes the chi-squared distance between a group element v t (considered as clients) and a potential group center vertex v j (considered as facilities), and the cost p j of opening a facility is fixed to d. Submodularity of the overall profit K has been proven.
- the representation of each candidate object segment may depend on the K- 1 selected object regions.
- the query segment can be placed at the last sequential order when feeding the set of segments into the Conv LSTM network 310.
- segments that are selected as one of the K- 1 segments their existence can be ignored in earlier entries of the sequential data, and the same procedure for extracting their representations can be followed.
- This section describes an exemplary technique that may be utilized for detecting objects.
- Detecting objects from cluttered scenes with occlusions is a long-term challenging problem.
- One aim can be to determine whether each candidate image segment contains the object of interest among all image segment proposals.
- the CPMC can be configured to generate up to 200 object segment proposals, where each segment proposal is a binary mask in irregular shapes instead of rectangulars.
- An object segment S £ can also be considered as recalled if its intersection-over-union (loU) score 0( *iS gt ) between S £ and the ground-truth object region is above 0.5, where 0(5 £ 5 5t ) can be computed as:
- 0( ) is the intersection-over-union (loLI) score
- S t is the mask of a candidate object segment
- each object image region can be zero-padded with the binary mask before feeding these image regions to Conv LSTM 310.
- the hidden layer values h can be extracted from the LSTM structure 330 for all candidate object regions, and a binary classifier (e.g., a binary lib-svm classifier) can be trained for each object category using the hidden layer features.
- the hidden layer dimension h of both LSTM modules are empirically set as 256.
- T“sequences” of objects can be generated, where each“sequence” contains K- 1 segments and the segment of interest is placed at the last“time-step.”
- the Conv LSTM 310 learned inter-object spatial context and object-scene dependencies with the recurrent unit using both semantic object label loss and semantic scene label loss, and the regression ConvNet was learned by mapping local object patches within scene images to parametrized object pose, position, and dimension variables so as to provide continuous-form 3D inferences.
- Experiments on NYU-v2 dataset demonstrate the effectiveness of introducing the LSTM recurrent unit into a pure ConvNet framework by showing consistent improvements over directly fine-tuned CNN. Also, it was demonstrated that training the regression ConvNet from scratch can achieve significantly less error rate than a fine-tuned CNN approach.
- Figure 4 is a flow chart of a method 400 for implementing an exemplary technique in accordance with certain embodiments.
- the exemplary method 400 may be executed in whole or in part by the scene generation system 150 in certain embodiments.
- one or more storage devices 201 can store instructions for performing the steps of method 400, and one or more processors 202 can be configured to execute performance of the steps of method 400.
- a first neural network 220 and a second neural network 230 are trained to perform scene understanding tasks.
- the scene understanding tasks can include tasks associated with detecting objects and scenes, classifying objects and scenes, and/or determining DOF information for objects in scenes.
- the first neural network 220 may represent a Conv LSTM 310, and the first neural network 220 can be trained to learn dependency information 254 (including inter-object dependency information and object-scene dependency information).
- the second neural network 230 may represent a regression ConvNet 320. Training the first neural network 220 and a second neural network 230 can include using various loss function (e.g., semantic scene label loss function 350, semantic object label loss function 360, and/or 3D geometrical loss function 370) functions to optimize the networks.
- various loss function e.g., semantic scene label loss function 350, semantic object label loss function 360, and/or 3D geometrical loss function 370
- a two-dimensional (2D) image 130 comprising a scene is accessed.
- the image 130 may be accessed by a scene generation system 150 in various ways. For example, in certain embodiments the image 130 is retrieved from a database 210.
- the image 130 may also be provided to the scene generation system 150 by a computing device 1 10 over a network 190.
- the first neural network 220 is executed to determine a first set of inferences associated with detecting objects in the scene and classifying the scene. Executing the first neural network 220 can include using the first neural network 220 to analyze the image 130 that was accessed. For example, the first neural network 220 can utilize the dependency information 254 and/or other data that was learn during training to analyze the image 130.
- step 440 the second neural network 230 is executed to determine a second set of inferences associated with determining degree of freedom information for the objects in the scene. Executing the second neural network 230 can include using the scond neural network 230 to analyze the image 130 that was accessed.
- a 3D scene 170 is synthesized that corresponds to the scene included in the 2D image 130 using the first set of inferences provided by the first neural network 220 and the second set of inferences provided by the second neural network 230.
- Joint inferences 340 can be provided by the first neural network 220 and the second neural network 230 to the 3D scene synthesizer 160 to generate the 3D scene 170.
- the 3D scene synthesizer 160 may utilize one or more 3D models to generate the
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/161,704 US10297070B1 (en) | 2018-10-16 | 2018-10-16 | 3D scene synthesis techniques using neural network architectures |
PCT/IB2019/053197 WO2020079494A1 (en) | 2018-10-16 | 2019-04-17 | 3d scene synthesis techniques using neural network architectures |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3867883A1 true EP3867883A1 (en) | 2021-08-25 |
EP3867883A4 EP3867883A4 (en) | 2022-07-20 |
Family
ID=66541186
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19873600.1A Withdrawn EP3867883A4 (en) | 2018-10-16 | 2019-04-17 | 3d scene synthesis techniques using neural network architectures |
Country Status (3)
Country | Link |
---|---|
US (1) | US10297070B1 (en) |
EP (1) | EP3867883A4 (en) |
WO (1) | WO2020079494A1 (en) |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9373059B1 (en) * | 2014-05-05 | 2016-06-21 | Atomwise Inc. | Systems and methods for applying a convolutional network to spatial data |
KR102061408B1 (en) * | 2017-03-24 | 2019-12-31 | (주)제이엘케이인스펙션 | Apparatus and method for analyzing images using semi 3d deep neural network |
US11273553B2 (en) | 2017-06-05 | 2022-03-15 | Autodesk, Inc. | Adapting simulation data to real-world conditions encountered by physical processes |
US10776662B2 (en) * | 2017-11-09 | 2020-09-15 | Disney Enterprises, Inc. | Weakly-supervised spatial context networks to recognize features within an image |
US11334612B2 (en) * | 2018-02-06 | 2022-05-17 | Microsoft Technology Licensing, Llc | Multilevel representation learning for computer content quality |
WO2019153245A1 (en) * | 2018-02-09 | 2019-08-15 | Baidu.Com Times Technology (Beijing) Co., Ltd. | Systems and methods for deep localization and segmentation with 3d semantic map |
US10832437B2 (en) * | 2018-09-05 | 2020-11-10 | Rakuten, Inc. | Method and apparatus for assigning image location and direction to a floorplan diagram based on artificial intelligence |
US11055866B2 (en) * | 2018-10-29 | 2021-07-06 | Samsung Electronics Co., Ltd | System and method for disparity estimation using cameras with different fields of view |
US10970599B2 (en) * | 2018-11-15 | 2021-04-06 | Adobe Inc. | Learning copy space using regression and segmentation neural networks |
US10957099B2 (en) * | 2018-11-16 | 2021-03-23 | Honda Motor Co., Ltd. | System and method for display of visual representations of vehicle associated information based on three dimensional model |
US10963757B2 (en) * | 2018-12-14 | 2021-03-30 | Industrial Technology Research Institute | Neural network model fusion method and electronic device using the same |
US10635938B1 (en) * | 2019-01-30 | 2020-04-28 | StradVision, Inc. | Learning method and learning device for allowing CNN having trained in virtual world to be used in real world by runtime input transformation using photo style transformation, and testing method and testing device using the same |
US10860878B2 (en) * | 2019-02-16 | 2020-12-08 | Wipro Limited | Method and system for synthesizing three-dimensional data |
US11100917B2 (en) * | 2019-03-27 | 2021-08-24 | Adobe Inc. | Generating ground truth annotations corresponding to digital image editing dialogues for training state tracking models |
US11126890B2 (en) * | 2019-04-18 | 2021-09-21 | Adobe Inc. | Robust training of large-scale object detectors with a noisy dataset |
US10909349B1 (en) * | 2019-06-24 | 2021-02-02 | Amazon Technologies, Inc. | Generation of synthetic image data using three-dimensional models |
US20210055933A1 (en) * | 2019-08-21 | 2021-02-25 | International Business Machines Corporation | Compliance policy management and scheduling |
KR20210030147A (en) * | 2019-09-09 | 2021-03-17 | 삼성전자주식회사 | 3d rendering method and 3d rendering apparatus |
US11521026B2 (en) * | 2019-10-21 | 2022-12-06 | Bentley Systems, Incorporated | Classifying individual elements of an infrastructure model |
CN112862944B (en) * | 2019-11-09 | 2024-04-12 | 无锡祥生医疗科技股份有限公司 | Human tissue ultrasonic modeling method, ultrasonic equipment and storage medium |
KR20210060779A (en) * | 2019-11-19 | 2021-05-27 | 현대자동차주식회사 | Apparatus for diagnosing abnormality of vehicle sensor and method thereof |
CN111127476B (en) * | 2019-12-06 | 2024-01-26 | Oppo广东移动通信有限公司 | Image processing method, device, equipment and storage medium |
US11816790B2 (en) * | 2020-03-06 | 2023-11-14 | Nvidia Corporation | Unsupervised learning of scene structure for synthetic data generation |
CN111833358A (en) * | 2020-06-26 | 2020-10-27 | 中国人民解放军32802部队 | Semantic segmentation method and system based on 3D-YOLO |
CN112734881B (en) * | 2020-12-01 | 2023-09-22 | 北京交通大学 | Text synthesized image method and system based on saliency scene graph analysis |
CN112581598B (en) * | 2020-12-04 | 2022-08-30 | 深圳市慧鲤科技有限公司 | Three-dimensional model construction method, device, equipment and storage medium |
CN112950760B (en) * | 2021-01-29 | 2023-08-11 | 杭州群核信息技术有限公司 | Three-dimensional synthetic scene data generation system and method |
US11899749B2 (en) * | 2021-03-15 | 2024-02-13 | Nvidia Corporation | Automatic labeling and segmentation using machine learning models |
US20230081641A1 (en) * | 2021-09-10 | 2023-03-16 | Nvidia Corporation | Single-image inverse rendering |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6539354B1 (en) * | 2000-03-24 | 2003-03-25 | Fluent Speech Technologies, Inc. | Methods and devices for producing and using synthetic visual speech based on natural coarticulation |
EP2030171A1 (en) * | 2006-04-10 | 2009-03-04 | Avaworks Incorporated | Do-it-yourself photo realistic talking head creation system and method |
WO2018111920A1 (en) * | 2016-12-12 | 2018-06-21 | The Charles Stark Draper Laboratory, Inc. | System and method for semantic simultaneous localization and mapping of static and dynamic objects |
US10836379B2 (en) * | 2018-03-23 | 2020-11-17 | Sf Motors, Inc. | Multi-network-based path generation for vehicle parking |
-
2018
- 2018-10-16 US US16/161,704 patent/US10297070B1/en active Active
-
2019
- 2019-04-17 EP EP19873600.1A patent/EP3867883A4/en not_active Withdrawn
- 2019-04-17 WO PCT/IB2019/053197 patent/WO2020079494A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
US10297070B1 (en) | 2019-05-21 |
EP3867883A4 (en) | 2022-07-20 |
WO2020079494A1 (en) | 2020-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10297070B1 (en) | 3D scene synthesis techniques using neural network architectures | |
US11816907B2 (en) | Systems and methods for extracting information about objects from scene information | |
JP6765487B2 (en) | Computer implementation methods using artificial intelligence, AI systems, and programs | |
CN110310175B (en) | System and method for mobile augmented reality | |
Whelan et al. | ElasticFusion: Real-time dense SLAM and light source estimation | |
US20220414911A1 (en) | Three-dimensional reconstruction method and three-dimensional reconstruction apparatus | |
CN114254417A (en) | Automatic identification and use of building floor plan information | |
Liu et al. | 3D Point cloud analysis | |
Koch et al. | Real estate image analysis: A literature review | |
CN115222896B (en) | Three-dimensional reconstruction method, three-dimensional reconstruction device, electronic equipment and computer readable storage medium | |
CN116385660A (en) | Indoor single view scene semantic reconstruction method and system | |
Mura et al. | Walk2map: Extracting floor plans from indoor walk trajectories | |
Mohan et al. | Room layout estimation in indoor environment: a review | |
Ruan et al. | A semantic octomap mapping method based on cbam-pspnet | |
Anjanappa | Deep learning on 3D point clouds for safety-related asset management in buildings | |
Zrira et al. | A novel incremental topological mapping using global visual features | |
Jahromi et al. | Geometric context and orientation map combination for indoor corridor modeling using a single image | |
Wei | Detecting as-built information model errors using unstructured images | |
EP4328868A1 (en) | Automated generation and use of building information from analysis of floor plans and acquired building images | |
Kang et al. | Near-real-time stereo matching method using temporal and spatial propagation of reliable disparity | |
Hueting et al. | Scene structure inference through scene map estimation | |
Chen et al. | Indoor point cloud semantic segmentation based on direction perception and hole sampling | |
Xu | Automated and Real-time Scan-to-BIM through Deep Learning-based Object Detection | |
Liu et al. | Transformer enhanced hierarchical 3D point cloud semantic segmentation | |
Hyeon et al. | Photo-realistic 3D model based accurate visual positioning system for large-scale indoor spaces |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20210301 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
A4 | Supplementary search report drawn up and despatched |
Effective date: 20220621 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06N 3/04 20060101ALI20220614BHEP Ipc: G06T 7/70 20170101ALI20220614BHEP Ipc: G06T 15/20 20110101AFI20220614BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20230119 |